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Abstract 


The  interpretation  system  presented  in  this  paper  uses 
multiple  image  resolutions  to  detect  objects  in  large,  cluttered, 
high  resolution,  aerial  images.  Lower  resolutions  are  processed 
first  so  attention  is  focused  in  the  higher  resolutions  to  those 
areas  where  there  is  evidence  of  the  objects.  The  system  provides 
a  framework  for  modeling  objects  at  each  resolution  using  a 
variety  of  complementary  image  features.  The  approach  is  both 
data-driven  and  model-driven,  utilizes  hypothesis  generation  and 
verification,  and  employs  evidential  reasoning  to  evaluate  the 
hypotheses.  The  system  has  been  exercised  on  two  aerial  im¬ 
ages;  one  consisting  of  a  single  submarine  and  analyzed  using 
three  resolutions,  and  the  other  consisting  of  three  airplanes  and 
analyzed  using  two  resolutions. 

1.  INTRODUCTION 

The  interpretation  of  a  large,  cluttered,  high  resolution 
image  can  be  facilitated  by  examining  features  extracted  from 
multiple  resolutions  of  the  image.  The  objects  of  interest  are 
modeled  at  each  resolution  in  terms  of  features  that  can  be  used  to 
provide  evidence  for  the  objects.  By  examining  lower  resolutions 
during  the  initial  stages  of  image  interpretation,  object 
hypotheses  can  be  made  based  on  large,  prominent  features 
without  the  complications  of  clutter  and  object  detail  present  at 
higher  resolutions.  Given  these  initial  hypotheses,  higher  res¬ 
olutions  are  examined  only  in  those  areas  in  which  objects  of 
interest  are  expected.  By  focusing  in  on  the  objects,  a  more 
efficient  search  of  the  cluttered,  high  resolution  images  is 
achieved. 

In  the  system  described  in  this  paper,  an  object  is  mod¬ 
eled  according  to  its  appearance  in  the  image  using  two  kinds  of 
features:  salient  features  that  create  initial  object  hypotheses,  and 
supporting  features  that  provide  evidence  for  the  hypothesized 
object  At  each  resolution,  hypotheses  can  result  from  two  types 
of  processing.  First,  a  hypothesis  may  result  from  a  feature  that 
is  due  to  an  instance  of  the  object  Second,  a  hypothesis  may 
result  from  objects  that  exist  at  a  lower  resolution  (e.g.,  an 
object  projects  itself  down  to  a  higher  resolution)  or  at  the  same 
resolution  (e.g.,  an  object  hypothesizes  another  within  the  same 
resolution).  In  either  case,  the  object  creates  hypotheses  in 
accordance  with  the  model  that  specifies  the  confirming  evidence. 

The  process  by  which  areas  in  multiresolution  images  are 
interpreted  as  scene  objects  can  be  broken  into  several  modules. 
Interactions  between  the  resolutions  during  interpretation, 
communication  and  feedback  between  selected  modules,  and 
infusion  of  scene  knowledge  into  appropriate  modules  have  the 


effect  of  improving  the  robustness  of  the  interpretation  process. 
In  a  simple  example  using  a  single  resolution  there  are  two 
modules:  the  Symbolic  Description  Module  and  the  Interpretation 
Module.  The  Symbolic  Description  Module  transforms  the 
original  image  into  symbolic  descriptions.  Any  of  the 
algorithms  can  use  ancillary  information  to  refine  parameter 
values.  For  example,  the  image  resolution  and  range  to  an  object 
will  affect  the  size  of  the  object  in  the  image;  this  information 
can  be  used  to  guide  a  procedure  that  looks  for  lines  that  are  the 
length  of  the  object.  The  Interpretation  Module  applies  ancillary 
information  and  the  object  models  to  the  symbolic  descriptions  to 
provide  a  consistent  description  of  the  scene.  As  a  resolution  is 
analyzed,  interpretations  from  previously  processed  resolutions 
are  made  available,  and  queries  of  other  resolutions  can  be  made. 
The  two  modules  interact  until  the  best  interpretation  is  found  for 
that  resolution.  (This  description  is  similar  to  a  context 
dependent  target  recognition  system  described  in  Silberberg  [10].) 

Among  the  issues  to  be  addressed  in  the  development  of  a 
multiresolution  image  interpretation  system  are  the  effective 
transformation  of  image  data  to  symbolic  descriptions,  the 
modeling  of  the  objects  and  their  parts,  the  representation  of 
symbolic  and  ancillary  information,  and  the  reasoning  used  to 
apply  the  knowledge  at  the  appropriate  resolutions  and  to  guide 
hypothesis  interaction  within  and  between  resolutions.  The 
image  interpretation  system  does  not  follow  an  algorithmic 
approach;  that  is,  it  does  not  consist  of  a  specific  ordering  of 
procedures.  Instead,  the  system  dynamically  chooses  procedures 
based  on  its  current  state.  The  approach  is  both  data-driven  and 
model-driven,  utilizes  hypothesis  generation  and  verification,  and 
employs  evidential  reasoning  to  evaluate  the  hypotheses.  The 
system  has  been  exercised  on  two  aerial  images;  one  consisting 
of  a  single  submarine  and  analyzed  using  three  resolutions,  and 
the  other  consisting  of  three  airplanes  and  analyzed  using  two 
resolutions. 

Several  single  resolution  image  interpretation  systems 
have  been  developed.  See,  for  example,  McKeown  et  al.  [6], 
Brooks  [2],  Hanson  and  Riseman  [4],  and  the  survey  by  Binford 
[1].  A  system  for  object  recognition  using  model  and  image 
pyramids  is  described  by  Neveu  et  al.  [7],  In  the  distributed 
system  of  Tan  and  Martin  [11],  a  multiresolution  pyramid  is  used 
to  detect  and  track  moving  objects. 

In  Section  2,  we  describe  the  data  structure  used  to  rep¬ 
resent  the  initial  symbolic  descriptions  and  the  resulting  hy¬ 
pothesized  scene  objects.  Section  3  presents  a  description  of  the 
multiresolution  image  interpretation  system.  The  performance  of 
the  system  on  two  images  follows  in  Section  4.  Conclusions 
and  future  work  are  discussed  in  Section  5. 


.-'.-V-V-V-V  .\v. 


2.  MULTIRESOLUTION  SYMBOLIC  PIXEL 
ARRAY 


Evidence  gathering  and  evaluation 


I 

An  effective  structure  for  representing  image-based  data 
must  support  operations  that  retrieve:  (1)  pixel  properties,  such 
as  intensity  and  edge  magnitude,  and  the  object  hypotheses  that 
exist  at  a  pixel;  (2)  object  properties,  such  as  area  and 
compactness  for  regions,  or  length  and  orientation  for  lines;  and 
(3)  object  spatial  relationships,  such  as  adjacency  and  nearness. 
Furthermore,  retrieval  of  resolution-specific  data  must  be 
supported.  We  use  a  data  structure  called  the  Multiresolution 
Symbolic  Pixel  Array  (MSPA)  which  organizes  the  images  and 
features  by  resolution  and  allows  efficient  retrieval  of  pixel  and 
object  properties  and  object  spatial  relationships  within  and 
between  resolutions. 

The  MSPA  is  an  extension  of  the  single  resolution 
Symbolic  Pixel  Array  (Payton  [8]).  Each  pixel  property  is 
stored  as  a  full-sized  array,  and  object  extent  in  the  image  is 
stored  as  a  binary  array  with  x  and  y  offsets.  The  binary  array  and 
its  offsets  are  referred  to  as  a  virtual  array.  In  the  MSPA,  pixel 
properties  are  retrieved  by  indexing  into  the  appropriate  full-sized 
array  at  a  specified  resolution.  Object  properties  are  computed  as 
they  are  needed  and  then  stored  explicitly  with  the  hypothesized 
object.  A  list  of  objects  that  exist  at  a  given  location  in  a 
specified  level  is  determined  by  referencing  each  virtual  array  at 
that  location.  Objects  that  have  been  hypothesized  in  the  same 
location  but  at  a  different  resolution  can  be  found  by  transforming 
the  location  coordinates  appropriately. 


The  interpretation  approach  is  object-oriented,  that  is, 
each  symbolic  description  is  initially  hypothesized  or  instantiated 
as  one  of  the  scene  objects.  The  instantiated  object  then  gathers 
information  which  provides  evidence  for  or  against  the  hy¬ 
pothesis.  The  scene  knowledge,  represented  as  object  models  and 
associated  rules,  directs  the  gathering  of  the  evidence.  If  enough 
evidence  is  found,  the  hypothesis  then  becomes  established,  and 
the  degree  to  which  all  of  the  evidence  confirms  or  disconfirms 
the  hypothesis  is  computed. 

Each  object  model  is  characterized  by  attributes  and  desir¬ 
able  relationships  between  these  attributes.  The  presence  or  ab¬ 
sence  of  each  attribute  provides  evidence  that  confirms  or  dis¬ 
confirms  an  instantiation  of  that  object.  Attributes  and  their 
relationships  are  represented  in  the  model  using  slots  and  rules 
that  specify  restrictions  and  relationships  between  these  slots. 
The  properties  of  each  slot  specify  the  type  and  number  of  the 
objects  that  can  fill  the  slot  as  well  as  methods  for  filling  the 
slot.  A  slot  can  be  filled  by  finding  an  existing  object  or  by 
hypothesizing  a  new  object.  The  rules  represent  a  collection  of 
information  that  determines  if  a  hypothesized  symbolic  de¬ 
scription  should  be  instantiated  as  a  particular  scene  object  and 
how  to  combine  the  evidence  to  compute  a  confidence.  The 
evaluation  and  combination  of  uncertain  evidence  follows  the 
method  set  forth  in  Mycin  [3],  that  is,  we  maintain  an  overall 
confidence  measure  computed  from  a  measure  of  belief  and  a 
measure  of  disbelief. 


By  performing  logical  operations  on  the  virtual  arrays,  we 
can  compute  spatial  relationships  quickly.  Such  spatial  re¬ 
lationships  include:  find-distance-from-objcct-to-point,  compute- 
distance-between-objects,  find-objects-on-object,  and  is-object- 
near-point  Since  the  system  is  object-oriented,  operations  take 
the  form  of  messages  passed  to  the  MSPA.  Appropriate 
arguments  to  these  spatial  operations  include  one  or  more 
objects,  x  and  y  coordinates,  allowed  object  types,  and  a  list  of 
one  or  more  resolutions.  As  an  example,  the  arguments  for  find- 
objects-on-object  are:  the  base  object,  a  list  of  resolutions  to  be 
examined,  specific  objects  to  ignore,  allowed  object  types,  and 
disallowed  object  types.  If  the  list  of  resolutions  is  empty,  then 
all  resolutions  are  checked.  If  the  base  object  does  not  exist  at  a 
specified  resolution,  then  an  object  representing  the  base  object  is 
created  temporarily.  As  a  second  example,  the  arguments  for 
compute-distance-between-objects  are:  object!,  object2,  and  the 
resolution,  R,  in  which  the  distance  is  interpreted.  The  distance 
is  computed  at  the  nearest  resolution  to  R  in  which  at  least  one 
of  the  objects  exist.  If  the  other  object  does  not  exist  at  that 
resolution,  it  is  created  temporarily.  The  default  for  R  is  the 
highest  resolution,  and  the  distance  must  be  returned  in  terras  of 
R. 

3.  MULTIRESOLUTION  INTERPRETATION 

This  section  provides  a  description  of  the  multiresolution 
image  interpretation  system.  First,  object  interpretation  using 
evidence  gathering  and  evaluation  is  summarized.  A  more 
complete  description  can  be  found  in  Silbcrberg  [10].  Next,  the 
organization  of  the  objects  in  an  interpretation  structure  and  the 
use  of  this  structure  to  guide  the  interpretation  of  a  single  image 
is  presented.  Finally,  multiresolution  interpretation  and 
hypothesis  interaction  are  described. 


Interpretation  of  a  single  resolution 

During  interpretation,  it  is  desirable  to  withhold  com¬ 
mitment  to  an  interpretation  if  there  is  insufficient  evidence  so 
that  complex  analysis  of  an  incorrect  interpretation  is  avoided. 
Additionally,  rather  than  attempting  interpretation  of  a  symbolic 
description  directly  as  an  expected  object,  it  is  often  more  appro¬ 
priate  to  gather  pieces  of  evidence  which  can  then  hypothesize 
that  object  A  general-to-specific  hierarchy  can  be  utilized  in 
realizing  the  former  objective;  a  network  that  specifies  support 
can  be  helpful  in  realizing  the  latter. 

The  structure  used  in  representing  the  scene  knowledge  is 
a  semantic  network  which  is  a  directed  graph  consisting  of  a  set 
of  vertices  and  a  set  of  labeled  edges  between  pairs  of  vertices.  In 
this  representation,  a  vertex  is  an  object  type,  and  the  label  on  an 
edge  represents  either  an  "a-kind-of  or  an  "in-support-of' 
relationship.  Figure  1  shows  the  representation  of  objects  in  the 
submarine  example.  The  solid  edges  represent  the  "a-kind-of' 
relationship  ("shadow  is  a-kind-of  region-sccne-object").  Broken 
edges  represent  possible  object  dependencies  where  the  object  at 
the  tail  of  the  edge  provides  support  for  the  object  at  the  head 
("shadow  is  in-support-of  submarine"). 

Initially,  each  of  the  original  symbolic  descriptions  is 
hypothesized  as  the  most  general  object  class  (scene-object  in 
Figure  1).  For  each  hypothesis,  object  slots  are  filled  by  known 
objects  or  hypothesized  new  objects.  In  the  event  that  one  or 
more  objects  hypothesize  another  object,  the  in-support-of 
relationships  are  utilized.  Objects  that  are  in  support  of  another 
can  singly  or  jointly  hypothesize  that  other  object. 


506 


Figure  1.  The  semantic  network  representation  for  a  sub¬ 
marine  image.  (The  network  is  similar  to  that  used  in  the 
experiments.) 


When  reasonable  evidence  has  been  supplied  to  all  of  the 
slots  of  a  hypothesized  object,  that  object  becomes  established, 
and  slot  values  that  yield  the  highest  confidence  for  the  object  are 
chosen  from  the  possibilities.  For  each  established  hypothesis,  a 
subclassification  following  the  "a-kind-of"  edges  backwards 
proceeds  unless  a  base  class  has  been  reached.  Once  all 
unreasonable  hypotheses  have  been  rejected,  the  interpretation 
process  begins  again.  Interpretation  continues  until  no  new 
hypothesis  is  generated. 

Before  the  final  interpretation  of  an  object  can  be  deter¬ 
mined,  the  confidence  computed  for  a  specific  class  must  be 
spread  throughout  the  whole  "a-kind-of'  hierarchy.  In  other 
words,  confirming  evidence  for  one  class  should  have  the  effect  of 
confirming  each  class  in  the  path  from  the  root  to  that  class  and 
disconfirming  every  other  class.  Similarly,  disconfirming 
evidence  for  one  class  should  also  disconfirm  every  subclass  of 
that  class  and  confirm  every  other  class.  The  final  interpretation 
of  an  object  results  in  that  class  with  the  strongest  global 
evidence  provided  that  the  evidence  is  above  some  preset 
threshold.  The  process  developed  for  this  system  can  be  found  in 
Kim  [5], 

Multiresolution  Interpretation 

In  the  current  system,  each  lower  resolution  is  computed 
by  averaging  the  gray  levels  of  2X2  blocks  of  pixels  at  the  next 
higher  resolution.  Since  this  construction  resembles  a  tapering 
pyramid,  we  refer  to  higher  (lower)  resolutions  as  being  lower 
(higher)  levels  where  the  highest  resolution  is  level  zero.  Each 
level  is  processed  separately  and  has  the  effect  of  creating 
hypotheses  at  the  next  lower  level  (higher  resolution).  This 
process  for  three  levels  is  shown  in  Figure  2  in  which  heavy 
arrows  represent  flow  of  control  and  light  arrows  represent  data 
flow.  Each  level  has  as  input  the  images  for  that  and  the  next 
lower  level,  object  models  specific  to  the  resolution  at  that  level, 
and  hypotheses  for  that  level  made  by  the  previously  interpreted 
level. 


Figure  2.  The  interpretation  process  for  3  levels.  Interpreta¬ 
tion  begins  at  level  2.  Hypotheses  arc  filtered  down  to  the 
lower  levels.  The  more  reliable  interpretations  are  found  in 
the  lower  levels. 

The  appearance  of  each  object  at  each  level  is  modeled  in 
terms  of  features  extracted  from  the  image  taking  into  account  the 
parts  of  the  object  and  the  other  objects  that  provide  support 
Thus,  for  example,  an  airplane  is  modeled  in  terms  of  regions  of 
the  appropriate  size  and  shape  that  represent  fuselage  and  wings 
(parts)  and  shadow  (support).  The  object  model  at  various  levels 
will  change  as  features  become  more  or  less  prominent.  For 
example,  airplane  engines  are  included  only  in  the  lower  level 
models. 

Interpretation  begins  at  the  highest  level  for  which  the 
existence  of  the  objects  can  be  inferred.  This  inference  can  be 
broad,  as  in  "Examine  only  these  areas  for  the  objects,"  or 
specific,  as  in  "There  is  specific  evidence  for  the  existence  of  an 
object  here."  Examination  at  lower  levels  where  more  object 
detail  is  present  is  guided  by  previous  interpretations.  This  is  an 
important  feature  since  without  the  focusing  of  attention 
mechanism,  larger  portions  of  the  cluttered,  high  resolution 
images  would  be  examined.  Such  examination  would  waste 
computational  resources  and  could  result  in  more  false  alarms. 
Although  interpretations  from  all  the  levels  can  be  examined, 
those  resulting  at  the  lower  levels  will  be  more  reliable. 

4.  EXPERIMENTS 

The  system  has  been  exercised  on  two  examples.  In  the 
first  example,  which  consists  of  an  image  containing  a  single 
submarine,  three  resolutions  are  processed  (levels  0,  1,  and  2). 
The  level  2  image  (128X128)  is  in  Figure  3.  Pairs  of  parallel 
lines  a  known  distance  apart  arc  used  to  create  initial  submarine 
hypotheses  at  levels  1  and  2.  Submarine  hypotheses  at  levels  0 
and  1  are  made  only  by  hypotheses  at  levels  1  and  2,  respectively. 
At  levels  1  and  2,  evidence  for  a  submarine  consists  of  regions 


representing  submarine  area  and  shadow,  lines  representing  glint,  The  interpretation  of  level  1  is  in  Figure  12a.  The  darker 

and  evidence  of  a  submarine  at  the  next  lower  level.  At  the  regions  represent  both  wing  and  fuselage  hypotheses.  The 

lowest  level,  submarine  support  consists  of  regions  representing  fuselage  hypotheses  created  the  airplane  hypotheses.  The  bright 

submarine  area,  shadow,  and  tail.  Since  the  submarine  tail  has  regions  are  airplane  shadows.  Figure  12b  shows  each  of  the  three 

small  area  and  may  not  be  visible  at  high  levels,  it  is  used  as  airplane  hypothesis  and  their  evidence  where,  due  to  reproduction, 

support  at  level  0  only.  only  l^e  shadow  regions  are  clearly  apparent.  Each  of  the 

airplane  hypotheses  at  level  1  create  hypotheses  at  level  0. 

After  applying  the  Canny  edge  operator  to  the  original 

image,  two-connected  contours  (each  pixel  has  at  most  two  The  three  airplanes  found  at  level  0  are  in  Figure  13. 

neighbors)  are  extracted  and  broken  at  points  of  high  curvature.  Since  the  areas  to  be  analyzed  are  specified  by  the  hypotheses  of 

Line  segments  are  then  extracted  by  fitting  lines  to  the  broken  level  1.  very  few  regions  are  considered  as  fuselage,  wing,  and 

contours.  Pairs  of  parallel  lines  with  a  minimum  length  and  a  airplane  shadow.  Again,  due  to  reproduction,  only  the  shadow 

specified  distance  apart  are  then  determined.  All  of  the  lines  regions  are  clearly  apparent 

chosen  as  a  member  of  a  parallel  line  pair  for  level  2  is  in  Figure 

4.  Regions  are  extracted  using  an  edge-based  segmentation  In  both  of  these  examples,  the  original  object  hypotheses 

(Figure  5)  (Perkins  [9]).  Line  pairs  and  regions  for  levels  0  and  1  are  made  at  the  higher  levels.  These  hypotheses  are  then  pro¬ 
will  not  be  shown.  The  resulting  interpretation  of  level  2  is  in  jected  to  the  lower  levels.  In  the  submarine  example,  the  lowest 

Figure  6a.  The  regions  are  both  submarine  area  and  shadow.  level  model  contains  additional  detail,  namely,  the  tail.  By 

The  lines  are  glint  and  submarine  hypotheses.  Two  submarines  analyzing  the  lower  levels,  the  system  is  able  to  disambiguate  the 

are  hypothesized  because  at  this  level  the  lines  delineating  the  two  original  submarine  hypotheses.  In  the  airplane  image,  only 

right  side  of  the  submarine  and  the  left  side  of  the  dock  are  those  areas  of  the  lower  level  near  the  original  hypotheses  are 

relatively  close  together.  In  Figures  6b  and  6c,  the  two  considered,  thereby  making  the  interpretation  process  more 

submarine  hypotheses  and  their  evidence  are  shown.  In  each  of  efficient.  Finally,  by  appropriately  choosing  feature  extraction 

these  images,  the  outer  lines  created  the  submarine  hypothesis,  algorithms,  airplane  and  shadow  areas  arc  reliably  extracted 

and  the  middle  line  is  the  glint  line.  making  interpretation  of  the  poor  quality  image  possible.  If,  on 

the  other  hand,  we  had  chosen  the  edge-based  segmentation 
The  interpretation  of  the  next  higher  resolution,  level  1,  technique  used  for  the  submarine  image,  the  interpretation  would 

is  shown  in  Figure  7a.  Again,  the  regions  arc  submarine  area  and  have  been  difficult,  if  not  impossif  'e. 

shadow,  and  the  lines  are  both  glint  and  submarine  hypotheses. 

At  this  level,  the  two  hypotheses  from  level  2  are  disambiguated  5.  CONCLUSIONS  AND  FUTURE  WORK 

resulting  in  only  one  submarine  hypothesis.  The  submarine  and 

its  evidence  are  shown  in  Figure  7b.  The  multiresolution  image  interpretation  system  described 

in  this  section  takes  advantage  of  the  reduction  in  image  clutter 
The  last  interpretation,  level  0,  is  in  Figure  8a.  The  and  object  scale  to  efficiently  and  reliably  detect  objects  of 

submarine  hypothesis  is  made  from  the  hypothesis  at  level  1.  interest.  The  system  represents  symbolic  image  descriptions  and 

Glint  is  not  used  as  evidence  at  this  level;  the  submarine  tail  scene  knowledge  efficiently  and  applies  the  scene  knowledge  to 

hypotheses  are  the  light,  compact  regions.  The  final  submarine  the  symbolic  descriptions  effectively.  All  symbolic  descriptions, 

hypothesis  and  its  evidence  at  level  0  are  shown  in  Figure  8b.  including  the  hypothesized  objects,  are  represented  in  the 

Multiresolulion  Symbolic  Pixel  Array  which  allows  uniform 
In  the  second  example,  the  image  contains  three  airplanes  treatment  of  all  image-based  objects  in  the  system.  The  system 
and  is  analyzed  using  two  levels.  The  level  1  image  (256X256)  incorporates  a  semantic  network  model  representation  that  is 

is  in  Figure  9.  Airplane  models  at  both  levels  consist  of  the  general  enough  to  be  easily  extended  to  include  additional  objects, 

fuselage  which  is  an  elongated,  high  intensity  regions;  one  or  The  simple  interpretation  process  adhers  to  the  principle  of  least 
two  wings  which  are  also  elongated  and  high  intensity  regions;  commitment  in  two  ways:  (1)  using  the  "a-kind-of"  relations, 

and  airplane  shadows  which  are  elongated  and  low  intensity  object  hypotheses  occur  only  if  there  exist  supporting  intrinsic 

regions.  At  the  lower  resolution,  level  1,  the  airplanes  are  feature  properties,  and  (2)  final  interpretations  arc  not  determined 

hypothesized  using  the  fuselage.  Support  consists  of  until  all  hypotheses  have  been  made.  Finally,  the  system 

neighboring  wings  oriented  at  the  correct  angle  and  neighboring  incorporates  both  data  and  model-driven  processing, 
shadow  regions.  Airplane  hypotheses  at  level  1  are  responsible 

for  creating  hypotheses  at  level  0.  The  system  has  been  exercised  on  a  submarine  image  and 

an  airplane  image.  Using  the  focus  of  attention  mechanism 
The  regions  for  detecting  airplanes  are  extracted  using  during  interpretation,  the  object  hypotheses  arc  disambiguated 
morphological  operators.  By  shrinking  the  bright  regions  in  the  (c.g.,  the  submarine  image)  and  high  resolution  images  arc  not 

original  image  (replace  each  pixel  by  the  minimum  in  its  fully  analyzed  (c.g.,  airplane  image).  The  submarine  model  is 

neighborhood)  and  then  growing  them  back  (replace  each  pixel  by  augmented  at  the  highest  resolution  with  additional  detail, 

the  maximum  in  its  neighborhood),  small,  high  intensity  regions 

are  eliminated.  These  regions  can  be  recovered  by  taking  the  Several  extensions  to  the  system  are  possible.  Additional 

difference  of  this  image  and  the  original  image.  Small,  low  interactions  between  levels,  that  is,  upward,  downward,  and 

intensity  regions  can  be  found  in  a  similar  way.  Bright  regions  between  more  than  just  neighboring  levels,  will  allow  object 

and  dark  regions  for  levels  1  and  2  are  show-i  in  Figures  10  and  hypotheses  from  both  lower  and  higher  levels.  This  is  important 

11,  respectively.  Given  these  regions,  the  images  are  interpreted,  when  object  supports  (or  parts)  are  found  at  a  lower  level  without 

starting  with  the  higher  level.  the  object  structure  being  found  at  a  higher  level,  or  when 

support  does  not  appear  in  immediately  adjacent  levels. 


Currently,  a  separate  object  hypothesis  is  instantiated  at  each 
level;  more  appropriate  is  the  creation  of  a  single  hypothesis  that 
represents  all  the  evidence  for  an  object  at  the  various  resolutions. 
Object  models  will  be  more  robust  with  the  explicit  inclusion  of 
such  ancillary  information  as  sun  angle,  range,  and  physical 
properties.  Automatic  model  generation  from  a  three- 
dimensional  model  and  ancillary  information  will  make  the 
interpretation  process  more  robust.  A  mathematically-based, 
multiresolution,  evidence  evaluation  scheme  still  needs  to  be 
investigated.  Finally,  experiments  with  additional  images  are 
clearly  necessary  in  order  to  evaluate  the  system  more  fully. 
This  includes  the  incorporation  of  robust  object  models  based  on 
a  wider  variety  of  image  features,  and  experiments  using  more 
than  three  levels  of  resolution. 
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Figure  3.  Level  2  submarine  image. 


Figure  4.  Level  2  parallel  line  pairs 
used  to  create  submarine  hypotheses. 
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Figure  5.  Level  2  regions. 
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(a)  Interpretation  of  level  including 
all  submarine,  submarine  area, 
shadow,  and  glint  hypotheses. 


(b)  First  submarine  hypothesis  and 
its  evidence. 


Figure  6. 


(c)  Second  submarine  hypothesis 
and  its  evidence. 
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(a)  Interpretation  of  level  1 . 


(b)  Sub  ine  hypothesis  and  evidence. 


(a)  Interpretation  of  level  0. 


(b)  Submarine  hypothesis  and  evidence 


Figure  9.  Level  1  airplane  image. 


(a)  Level  1  bright  regions. 


(b)  Level  1  dark  regions. 


Figure  10. 


(a)  Level  0  bright  regions. 


(b)  Level  0  dark  regions. 


Figure  1 1 . 


(a)  Level  1  fuselage,  wing,  and 
shadow  hypotheses. 


(b)  Airplane  hypotheses  and 
evidence. 


Figure  13.  Level  0  airplane  hypotheses. 


Figure  12. 
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Abstract 

Vision  systems  which  rely  on  low-level  repre¬ 
sentations  such  as  edges  and  regions  have  difficulty 
handling  complex  natural  images  for  high  and  mid 
level  visual  tasks.  Representations  of  structural 
relationships  in  the  arrangements  of  primitive  im¬ 
age  features,  as  detected  by  the  perceptual  orga¬ 
nization  process,  are  essential  for  analyzing  com¬ 
plex  imagery.  We  term  these  representations  col¬ 
lated  features.  In  the  detection  of  collated  features, 
structural  information  is  utilized  to  overcome  lo¬ 
cal  problems  such  as  noise  and  poor  contrast  of 
real  images.  The  structural  information  encoded 
in  collated  features  is  essential  for  such  processes 
as  visual  reasoning,  object  segmentation  and  shape 
description.  Their  influence  is  not  limited  to  the 
above  mentioned  “high-level"  visual  processes  as 
the  collated  features  also  actively  interact  with 
such  “mid-level"  processes  as  stereo  and  serve  as 
attentional  mechanisms  for  “low-level”  feature  de¬ 
tection. 

We  present  a  vision  system  that  illustrates,  for 
a  particular  visual  domain,  the  concept  of  col¬ 
lated  features,  the  process  of  their  detection  and 
their  interaction  with  the  diverse  visual  processes 
of  stereo,  monocular  and  3D  reasoning,  shape  de¬ 
scription  and  object  extraction. 

1  Introduction 

The  ability  to  extract  and  describe  distinct  3D  objects 
in  a  complex  scene  is  crucial  for  an  image  understand¬ 
ing  system.  The  traditional  approaches  are  edge-based 
or  region-based.  In  edge-based  methods,  local  edges  are 
detected  and  then  linked  into  contiguous  curves.  These 
curves  typically  do  not  give  complete  boundaries  for  com¬ 
plex  objects  and  many  curves  that  correspond  to  texture, 
surface  marking  and  noise  are  present.  Many  attempts 
have  been  made  to  connect  these  curve  segments,  us¬ 
ing  “contour  tracing”  methods,  into  meaningful  objects. 

Supported  in  part  by  DARPA,  contract  #  F33615-87-C-1436,  mon¬ 
itored  by  the  Air  Force  Wright  Aeronautical  Laboratories,  DARPA 
order  #  3119,  and  in  part  by  DMA  contract  #  800-85-C-0008 


Such  techniques  have  been  successful  for  relatively  simple 
scenes  but  fail  in  more  complex  environments.  Region- 
based  methods  do  give  closed  regions,  by  construction, 
but  the  regions  often  do  not  correspond  to  the  objects  in 
complex  environments.  The  key  problem  with  these  seg¬ 
mentation  techniques  is  their  myopic  nature.  The  operate 
locally  on  the  image  intensities  and  do  not  utilize  global 
information.  Thus,  these  systems  fail  when  faced  with 
local  problems  such  as  poor  contrast,  noise,  shadows  and 
occlusions. 

One  alternative  to  dealing  with  these  fragmented  seg¬ 
mentations  is  to  use  “model-based”  techniques;  a  survey 
can  be  found  in  [1],  Model-based  techniques  usually  rely 
on  a  priori  knowledge  of  the  objects  in  the  scene  and 
predict  the  appearance  of  an  object  to  the  low-level  de¬ 
scriptions  that  can  be  extracted  from  the  fragmented  seg¬ 
mentation.  While  impressive  results  can  be  obtained  for 
restricted  domains,  we  believe  that  this  approach  is  too 
restrictive  and  not  usable  for  complex  scenes  and  complex 
tasks. 

The  alternative  we  pursue  in  this  work  is  that  of  group¬ 
ing  the  fragmented  low-level  descriptions  into  higher-level 
groups.  Humans  are  very  good  at  such  perceptual  group¬ 
ing  tasks;  we  can  typically  see  shapes  in  arrangements  of 
dots  or  even  poor  machine  generated  edge  outputs  from 
even  very  complex  scenes.  We  believe  that  such  grouping 
must  also  be  an  essential  element  for  machine  vision  sys¬ 
tems  if  they  are  to  have  generality  and  be  able  to  handle 
models  that  are  not  given  very  specifically.  We  further 
believe  that  such  groupings  are  useful  for  high-level  pro¬ 
cesses  such  as  object  segmentation  and  shape  description 
and  are  likely  to  be  a  precursor  for  even  intermediate 
processes  like  stereo.  Stereo  is  sometimes  believed  to  give 
unambiguous  data  to  help  in  the  grouping  process,  but 
we  show  that  sometimes  the  reverse  may  be  more  appro¬ 
priate. 

In  section  2  we  discuss  the  role  of  perceptual  orga¬ 
nization  in  terms  of  the  representations  generated  by  it 
and  the  use  of  these  representations  by  other  visual  pro¬ 
cesses.  In  section  3  we  present  a  visual  domain  which  we 
will  be  using  to  illustrate  the  concepts  of  collated  features, 
their  detection  and  their  application  in  various  visual  pro- 


cesses.  In  section  4  we  present  the  technique  used  to  de¬ 
tect  collated  features.  All  reasonable  feature  groupings 
are  initially  postulated  as  collated  features.  A  process  of 
mutual  cooperation  and  competition,  outlined  in  section 
5,  is  then  used  to  select  the  more  promising  collations. 
We  show  the  application  of  collated  features  for  stereo 
matching  in  section  6,  and  for  visual  reasoning,  object 
segmentation  and  shape  description  in  section  7. 

We  have  tested  the  vision  system  employing  collated 
features  on  several  stereo  pairs  of  images.  We  present 
some  collated  features  detected,  and  the  3D  descriptions 
generated  of  the  objects  located  in  some  of  these  images. 

2  Role  of  Perceptual  Organization 

It  has  been  shown  that  our  visual  system  can  immedi¬ 
ately  detect  such  feature  relationships  as  collinearity,  par¬ 
allelism,  connectivity  and  repetitive  patterns  when  shown 
an  otherwise  randomly  distributed  set  of  image  elements 
[2].  This  phenomenon  is  called  perceptual  organization 
and  has  been  studied  by  investigators  of  psychology  and 
computer  vision  [2-13]  In  spite  of  this  active  inquiry  into 
the  phenomenon  of  perceptual  organization  and  attempts 
at  implementing  some  aspects  of  it,  little  systematic  use 
of  perceptual  organization  has  been  made  in  computer 
vision  systems.  The  reasons  for  this  lie  in  the  difficulty 
in  detecting  the  relationships  found  by  perceptual  organi¬ 
zations  and  in  designing  proper  representations  and  tech¬ 
niques  to  use  these  relationships  in  other  visual  processes. 

Psychophysical  experiments  show  the  preference  of  our 
visual  system  for  some  particular  relationships  among  im¬ 
age  features.  Collinearity,  connectedness,  completeness, 
symmetry,  proximity  and  repetition  are  among  the  re¬ 
lationships  preferred.  The  common  denominator  of  these 
relationships  is  their  structural  nature.  Thus,  the  primary 
purpose  of  perceptual  organization  is  to  make  salient  the 
structural  interrelationships  between  image  features.  We 
propose  that  perceptual  organization  takes  primitive  im¬ 
age  elements  typically  generated  by  low-level  segmenta¬ 
tion  processes  and  generates  representations  of  feature 
groupings  which  encode  the  structural  interrelationships 
between  the  component  elements.  We  term  these  rep¬ 
resentations  as  collated  feature t.  Collated  features  iden¬ 
tify  those  structural  relationships  that  are  most  common 
among  objects  of  our  visual  domain  and  remain  invari¬ 
ant  in  2D  projections  over  most  viewpoints  and  repre¬ 
sent  these  relationships  in  forms  suitable  for  other  vi¬ 
sual  processes.  In  other  words,  collated  features  identify 
structures  (among  arrangements  of  image  features)  that 
have  high  probability  of  corresponding  to  object  struc¬ 
tures  ( significance )  and  are  useful  for  other  visual  pro¬ 
cesses  (utility). 


2.1  Collated  Features 

Primitive  image  features,  in  their  raw  form,  provide  insuf¬ 
ficient  information  (explicitly,  for  in  general  terms  all  vi¬ 
sual  information  is  embedded  in  these  features)  to  higher 
visual  processes.  The  structural  information  encoded  in 
collated  features  is  crucial  to  the  proper  functioning  of 
higher  visual  processes.  This  is  a  major  departure  from 
previous  vision  systems  which  rely  primarily  on  such  im¬ 
age  features  as  edges  and  regions.  The  representational 
power  of  primitive  image  features  is  very  low;  generic  de¬ 
scriptions  of  them  such  as  compact  regions,  blue  regions 
etc.  can  only  support  very  weak  inferences.  Moreover, 
such  descriptions  are  completely  unsuitable  for  deriving 
meaningful  descriptions  of  objects.  Some  of  the  defi- 
ciences  of  primitive  image  elements  can  be  bypassed  by 
model-based  vision  systems  which  match  precise  models 
of  particular  objects  directly  to  image  features.  Even  for 
such  systems,  it  has  been  found  that  some  preliminary 
grouping  of  image  features  is  useful  [3,2]. 

The  complex  nature  of  natural  images  brings  forward 
another  inadequacy  of  primitive  image  features  as  rep¬ 
resentations  useful  for  higher  visual  processes.  In  real 
images  the  problems  of  noise,  poor  contrast  and  occlu¬ 
sion  often  result  poor  edge  contours  and  region  segments. 
Thus  some  vision  systems  which  rely  primarily  on  primi¬ 
tive  image  features,  and  are  able  to  demonstrate  reason¬ 
able  performances  on  simple  scenes  imaged  under  ideal 
situations,  fail  misserably  when  faced  with  natural  im¬ 
ages.  On  the  other  hand,  for  detecting  collated  feature 
we  use  structural  information  from  various  parts  of  the 
scene  which  result  in  more  robust  detection.  This  indi¬ 
cates  that  not  only  are  the  collated  features  generated  by 
perceptual  organization  useful  but  crucial  to  higher  level 
visual  processing. 

Collated  features  are  groupings  of  image  elements  struc¬ 
turally  relevant  to  objects  in  the  visual  domain.  They 
represent  groupings  of  image  features  which  have  a  high 
probability  of  corresponding  to  object  features.  Thus  col¬ 
lated  features  “mark”  the  image  features  relevant  to  ob¬ 
ject  extraction.  Collated  features  identify  structural  re¬ 
lationships  among  image  features  which  also  exist  among 
object  features.  Furthermore,  collated  features  collect  im¬ 
age  features  that  probably  correspond  to  the  same  object 
(or  part). 

Object  shapes  are  described  in  terms  of  component 
shapes.  This  decomposition  of  shape  description  follows 
a  structural  basis  where  individual  shape  components  are 
“well  formed”  and  have  “simpler”  individual  descriptions 
than  that  of  the  combined  shape.  The  collation  of  features 
identifies  such  individual  structures  that  are  useful  in  de¬ 
scribing  overall  shapes.  Collated  features  also  form  small 
local  structural  descriptions.  These  can  be  combined  to 
give  global  object  descriptions,  or  be  transformed  to  other 
descriptions  more  suitable  for  a  particular  visual  process. 


Experiments  by  Julesz  [4]  and  various  stereo  algo¬ 
rithms  [16-18]  may  have  portrayed  stereo  as  an  early  stage 
processing  which  is  performed  prior  to  any  feature  group¬ 
ing.  However,  the  difficulty  of  fusing  random  dot  stere¬ 
ograms,  given  the  absence  of  ambiguity  in  matching  such 
patterns  (since  random  patterns  are  relatively  free  from 
repetitions)  in  contrast  to  the  ease  of  fusing  real  images, 
indicates  that  the  stereo  system  does  use  more  evolved 
representations  in  addition  to  primitive  ones.  Supporting 
evidence  from  this  comes  from  new  stereo  systems  [19-22] 
which  have  demonstrated  the  use  of  feature  groupings  in 
aiding  stereo  matching.  Collated  features  are  more  struc¬ 
turally  evolved  representations  than  edges,  thus  there  is 
much  less  ambiguity  in  matching  collated  features  than  is 
in  matching  primitive  features.  We  believe  that  match¬ 
ing  between  collated  features  generates  a  rough  disparity 
map  by  indicating  match  between  feature  pools  and  this 
can  guide  matching  between  more  precisely  located  im¬ 
age  features  such  as  edges.  While  collated  features  can 
in  a  similar  fashion  help  other  visual  matching  tasks  such 
as  large  time  interval  motion  sequences  or  matching  vi¬ 
sual  representations  (such  as  maps)  to  images;  we  will  not 
discuss  them  here. 

By  the  identification  of  significant  features,  collated 
features  can  act  as  attentional  mechanisms  for  detailed 
visual  inspection,  guiding  the  detection  of  features  at  in¬ 
teresting  locations  in  greater  detail. 

3  A  Visual  Domain 

We  believe  that  the  representations  formed  by  percep¬ 
tual  organization  are  crucial  for  high  level  visual  tasks. 
To  demonstrate  our  view,  we  have  chosen  a  domain  of 
natural  imagery  sufficiently  complex  to  be  intractable  by 
techniques  primarily  dependent  on  low-level  representa¬ 
tions  such  as  edges  and  regions,  but  with  the  viewpoints 
and  the  shape  of  the  objects  so  restricted  that  we  can  get  a 
simple,  unambiguous  list  of  structural  relationships  to  be 
encoded  into  collated  features  by  the  perceptual  organiza¬ 
tion  process.  This  system  will  illustrate  the  methodology 
for  the  detection  and  application  of  collated  features  and 
would  serve  as  a  framework  for  a  more  general  system. 

The  domain  of  the  vision  system  CANC  [5]  consists  of 
aerial  images  of  suburban  scenes.  The  objects  of  interest 
are  buildings.  The  vision  system  works  by  first  detect¬ 
ing  collated  features  in  the  images  and  then  employing 
these  collated  features  in  various  visual  processes  to  gen¬ 
erate  descriptions  a.id  a  three  dimensional  model  of  the 
buildings  in  the  scene. 

3.1  Previous  Work 

A  lot  of  work  has  been  done  in  computer  vision  on  de¬ 
tecting  buildings  and  other  man  made  structure  in  aerial 


images.  While  a  wide  variety  of  techniques  have  been 
applied  towards  this  task,  a  systematic  use  of  perceptual 
grouping  has  been  lacking.  Another  interesting  observa¬ 
tion  is  that  while  man  made  objects  have  rich  geomet¬ 
ric  structure,  little  use  of  this  structural  information  was 
made  in  the  older  systems. 

In  [6]  a  region  segmenter  is  used  and  the  relationships 
between  such  objects  as  roads  and  houses  is  used  to  im¬ 
prove  detection.  In  [7]  lines  and  junctions  are  used  as 
features.  Height  is  obtained  from  stereo  or  2D  reasoning 
to  generate  a  wire  frame  model  which  is  then  completed 
by  hypothesizing  likely  structures.  Contour  tracing  with 
some  structural  guidance  as  oriented  corners  and  depth 
from  shadow  has  been  used  in  [26-28],  Fua  and  Hanson 
[8]  segment  the  scene  into  regions,  find  edges  lying  on 
region  boundaries  and  then  see  if  there  is  evidence  of  geo¬ 
metric  structure  among  these  edges  to  classify  the  region 
as  a  man-made  object.  In  the  VISIONS  system  [9],  re¬ 
gion  segmentation  is  the  primary  technique  used  and  the 
regions  are  classified  by  their  shape  and  spectral  proper¬ 
ties.  SPAM  [10]  is  a  map  based  system  which  uses  region 
segmentation  of  aerial  imagery.  Recently,  applications  of 
perceptual  grouping  to  locate  features  indicating  struc¬ 
ture  have  been  explored  [11], 

Most  of  these  systems  work  in  simple  scenes,  for  ex¬ 
ample  rural  scenes,  where  the  building  roof  can  be  sim¬ 
ply  segmented  (and  even  identified)  from  the  surround 
on  spectral  poperties.  The  buildings  detected  have  sim¬ 
ple  shapes  (MOSAIC  [7]  handles  complex  buildings  and 
urban  environments,  but  makes  many  errors).  Only  a  few 
systems  compute  and  use  of  depth  information.  None  of 
the  systems  generate  a  description  of  the  buildings  at  the 
level  of  shape  descriptions  of  the  different  wings. 

3.2  Problems  Specific  to  this  Domain 

Fig.  1  shows  a  stereo  pair  of  images  of  a  building  with 
wings  of  various  heights  in  a  suburban  environment.  The 
building  is  easy  for  humans  to  see,  even  without  stereo, 
but  it  is  in  fact  very  difficult  for  current  vision  systems. 
Fig.  2  shows  the  line  segments  detected  in  the  image-pair 
(Fig.  1).  We  are  still  able  to  see  the  roof  structures  of  the 
buildings  readily  and  easily,  but  the  complexity  of  the 
task  now  becomes  more  apparent.  In  previous  work,  as¬ 
suming  a  generic  model  of  the  building  shapes  and  other 
knowledge  (such  as  shadows),  it  has  been  possible  to  ex¬ 
tract  the  desired  objects  in  somewhat  simpler  environ¬ 
ments  [26-28].  The  current  scene,  however,  stretches  the 
limits  of  such  “contour-tracing”  like  methods. 

Many  real  images,  of  which  aerial  imagery  of  urban 
scenes  is  an  example,  present  a  multitude  of  problems 
which  are  insurmountable  by  region  and  edge  based  seg¬ 
mentation  systems.  In  the  case  of  buildings,  the  roofs  of 
their  various  wings  are  made  of  same  or  similar  material. 
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Figure  1:  Aerial  image  of  a  suburban  scene. 


Therefore,  there  is  often  very  little  contrast  between  ad¬ 
joining  roofs.  Areas  adjoining  roofs  like  curbs,  parking 
lots,  walkways  are  often  made  of  materials  (for  example 
concrete)  similar  to  that  used  for  the  roofs.  There  are 
random  shaped  patches  on  roofs  due  to  dirt  and  varia¬ 
tion  in  material.  The  small  structures  on  the  roofs  and 
the  objects  such  as  car,  and  trees  adjoining  buildings  con¬ 
fuse  region  finders  and  contour  tracers.  The  shadows  and 
surface  markings  on  the  roof  cause  similar  problems. 

There  are  other  characteristics  of  these  images  which 
specifically  cause  problems  for  contour  tracking  type  sys¬ 
tems  [26-28] .  Roofs  have  raised  borders  which  sometimes 
cast  shadows  on  the  roof.  This  results  in  multiple  close 
parallel  edges  along  the  roof  boundaries  and  often  these 
edges  are  broken  and  disjoint.  At  roof  corners  and  at 
junctions  of  two  roofs,  multiple  lines  meet  leading  to  a 
number  of  corners  making  it  difficult  to  choose  a  corner 
for  tracking.  Roofs  cast  shadows  along  their  sides  and 
often  have  objects  on  the  ground  near  them  like  grass 
lots,  trees,  trucks,  pathways  etc.,  which  lead  to  changes 
of  contrast  along  the  roof  sides.  Thus  while  tracking  one 
can  face  reversal  in  edge  direction.  Often  some  structures 
both  on  the  roof  and  on  the  ground  are  so  near  the  roof 
that  the  border  edges  get  merged  with  the  edges  of  these 
objects,  leading  contour  trackers  off  the  roofs  onto  the 
ground  or  inside  the  roof.  At  junctions  it  is  difficult  to 
decide  which  path  to  take.  Searching  all  paths  at  junc¬ 
tions  leads  to  a  combinatorial  explosion  of  paths.  It  may 
be  difficult  to  decide  on  the  correct  contours  since  con¬ 
tours  may  not  close  because  of  missing  edge  information, 
or  more  than  one  closed  contours  may  be  generated.  Con¬ 
tours  may  merge  roofs  or  roofs  and  parts  of  the  ground. 
Figures  2,  15  and  21  illustrate  some  of  these  problems. 
Note  the  edges  around  the  roof  of  the  building. 

Stereo  analysis  is  also  difficult  here.  The  roof  tops 
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Figure  2:  Linear  segments  detected  in  Fig.  1 

have  little  texture  and  thus  little  context  is  available  for 
the  matching  of  their  boundaries.  In  fact,  the  usual  as¬ 
sumption  that  disparity  changes  smoothly,  is  violated  in 
very  many  places,  since  the  important  boundaries  largely 
represent  depth  discontinuities. 

3.3  Choice  of  Collated  Features 

Most  buildings  such  as  commercial  buildings,  residential 
complexes,  warehouses,  offices,  barracks,  airport  termi¬ 
nals,  railway  and  bus  stations,  and  school  and  college 
buildings  can  be  modeled  as  being  compositions  of  par- 
allelopiped  shapes.  In  aerial  images,  building  walls  get 
occluded  by  roofs  or  other  walls  and  when  visible  appear 
heavily  foreshortened.  Roofs  form  the  only  clearly  visi¬ 
ble  parts  of  the  buildings,  in  such  images.  The  roofs  lie 
parallel  to  the  imaging  plane,  and  the  imaging  distance  is 
large  compared  to  the  size  of  the  roofs,  so  the  view  of  the 
roofs  is  nearly  orthogonal. 

Roof  shapes  can  be  modeled  as  a  combination  of  rect¬ 
angles.  For  example,  L,  T,  H  and  E  shapes  are  all  combi¬ 
nations  of  rectangles.  The  important  structural  relation¬ 
ships  among  features  of  objects  of  this  visual  domain  are 
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those  of  linearity,  parallelism,  U-shape-ness  and  rectangu- 
larity.  Therefore,  it  is  these  feature  groupings  that  would 
be  made  significant  by  perceptual  organization.  The  col¬ 
lated  features,  which  we  use,  are  lines,  parallels,  Us  and 
rectangles.  Note  that  the  collated  feature  named  rect¬ 
angle  is  suitable  for  describing  the  shape  of  the  objects 
detected.  We  have  given  names  such  as  U  and  rectangle 
to  these  collated  features,  but  these  are  just  convenient 
nomenclatures  for  the  structural  property  of  the  collation 
and  does  not  indicate  any  recognition  or  description  of 
the  shape  at  this  stage. 

In  the  visual  domain  we  have  chosen,  the  viewpoint 
is  directed  towards  the  ground  plane  and  the  choice  of 
collated  features  reflects  this.  The  problem  that  some  of 
the  structural  relationships  may  change  with  a  change  in 
the  direction  of  the  viewpoint  does  not  arise  since  there  is 
only  one  allowed  viewpoint  direction  for  this  domain.  It  is 
possible  that  some  other  collated  features  may  be  formed 
for  this  domain,  such  as  groups  of  regularly  spaced  paral¬ 
lels  (due  to  roads)  and  local  directional  maps  [12]  (due  to 
textured  areas  such  as  grass).  We  shall  not  consider  such 
collations  here  since  they  are  not  relevant  to  buildings. 

4  Detection  of  Collated  Features 

The  detection  of  collated  features  works  bottom-up  from 
edges,  grouping  them  step  by  step  into  more  and  more  ge¬ 
ometrically  evolved  shapes.  Initially  all  reasonable  group¬ 
ings  are  considered  as  collated  features;  this  set  is  then 
refined  by  identifying  the  significant  collations  as  more 
global  context  from  other  collated  features  become  avail¬ 
able  after  the  initial  groupings  are  made. 

The  collated  features  are  generated  in  a  recursive  man¬ 
ner.  First  simple  collated  features  are  formed  by  grouping 
edges  on  simple  geometric  relationships.  These  collations 
in  turn  serve  as  tokens  for  more  complex  collated  features 
which  represent  some  geometrical  relationship  among  the 
component  structures.  Thus  a  hierarchy  of  collated  fea¬ 
tures  is  created  moving  from  simple  collations  such  as 
lines  up  to  rectangles.  This  hierarchy  represents  an  evo¬ 
lution  of  geometrical  organization,  elements  higher  up  in 
the  hierarchy  reflect  more  organization  and  thus  have  a 
higher  probability  of  reflecting  an  actual  similar  organi¬ 
zation  in  the  scene  rather  than  an  accidental  collection 
of  image  elements.  The  simpler  collated  features  such  as 
lines  are  less  structure  specific,  since  they  could  belong  to 
a  variety  of  shapes,  but  as  we  move  up  the  hierarchy,  the 
more  complex  collations  such  as  rectangles  are  more  spe¬ 
cific  because  they  allow  limited  structural  interpretations 
of  the  shape. 

The  large  collated  features  and  their  subsidiary  col¬ 
lations  exchange  information  in  both  directions.  Sim¬ 
ple  collated  features  are  used  to  form  the  more  complex 
groupings  which  in  turn  help  the  formation  of  the  simpler 


collated  features  by  use  of  their  more  global  view.  This 
symbiotic  relationship  will  become  more  evident  in  the 
following  subsections. 

The  collated  features  provide  multiple  descriptions  of 
the  structure.  These  descriptions  usually  differ  in  the  re¬ 
lationships  described  or  the  scale.  Also  alternative  group¬ 
ings,  at  the  same  descriptive  level,  of  the  features  (col¬ 
lated  or  primitive)  exist.  Usually  a  process  of  selection 
culls  the  alternatives  as  more  context  from  other  collated 
features  becomes  available,  or  as  other  visual  processes 
provide  more  information,  letting  only  unique  collations 
of  the  underlying  visual  tokens  to  survive.  However,  there 
are  cases  where  the  visual  information  is  not  sufficient  to 
clearly  identify  the  preferred  collated  features,  alterna¬ 
tive  groupings  may  survive  (as  can  be  seen  in  some  Moire 
pattern). 

We  have  discussed  previously  the  choice  of  lines,  par¬ 
allels,  Us  and  rectangles  as  the  set  of  collated  features. 
We  now  describe  the  process  of  their  detection  in  detail. 
The  input  set  of  primitive  image  features  consists  of  lin¬ 
ear  segments  and  corners.  We  use  the  “USC  LINEAR” 
system,  based  on  the  “Nevatia-Babu”  line  finder  [13]  to 
detect  linear  segments  in  the  scene.  Two  types  of  corners 
are  detected,  L  and  T  junctions.  We  currently  do  not 
investigate  orthogonal  trihedral  vertices  (OTVs)  as  few 
walls  are  visible,  and  those  that  are,  appear  highly  fore¬ 
shortened  and  have  shadows  etc.  near  them  making  the 
OTVs  difficult  to  detect  accurately.  T  junctions  for  urban 
aerial  imagery  do  not  have,  in  general,  the  usual  interpre¬ 
tations  of  occlusion.  The  buildings  have  wings  which  are 
aligned  and  nearby  objects  like  roadways,  etc.  are  also 
aligned  to  the  building  sides.  Thus  in  the  top  view,  the 
sides  of  two  different  structures  can  create  T  junctions  in 
which  the  top  line  belongs  to  two  different  objects  and  is 
not  occluding  the  stem. 

4.1  Lines:  Linear  Structures 

Due  to  poor  contrast,  many  linear  segments  along  the  roof 
borders  get  fragmented.  Near  junctions  or  close  presence 
of  other  strong  features,  these  fragments  get  displaced 
from  the  straight  line  the  border  lies  on,  making  simple 
collinearization  useless. 

A  collection  of  parallel  lines  bunched  along  the  same 
linear  axis  represents  the  presence  of  a  linear  structure 
at  a  higher  granularity  level  than  the  edges  (for  example 
the  boundary  of  the  roof  as  opposed  to  the  individual 
lines  belonging  to  its  borders).  We  wish  to  group  closely 
bunched  parallel  linear  segments  since  they  represent  a 
linear  structure  of  some  object,  like  the  border  of  a  roof 
or  the  divider  on  a  road. 

To  detect  such  groupings  of  edges,  we  “fold”  the  space 
around  each  segment  onto  the  segment  repeatedly,  like 
pleats  in  an  accordion,  collecting  the  segments  from  this 


Figure  3:  Linear  structures  obtained  from  Fig.  2 


space  which  lie  parallel  to  it.  This  folding  process  is 
halted  as  soon  as  no  new  segments  are  located  or  when 
the  threshold  on  the  spread  about  the  linear  structure  is 
exceeded.  Fig.  3  shows  the  lines  obtained  from  Fig.  2. 


Figure  4:  Parallels 

4.2  Parallels 

For  each  line  obtained  above,  we  find  lines  that  are  paral¬ 
lel  to  it  and  have  a  sufficient  overlap  with  it.  Man-made 
structures  in  urban  scenes  like  building-wings,  roads  and 
parking  lots  are  organized  in  regular  grid-like  patterns. 
These  structures  are  all  composed  of  parallel  sides.  As  a 
consequence,  for  each  significant  line-structure  detected 
in  the  scene,  there  is  not  one  but  numerous  lines  parallel 
to  it. 

The  formation  of  the  parallel  collated  features  in  turn 
aids  the  formation  of  lines.  The  structure  of  two  par¬ 
allels  lines  strongly  suggests  a  complete  overlap  between 
the  two.  If  l lie  original  lines  do  not  overlap  completely, 
their  extensions  which  will  complete  the  overlap,  are  con¬ 
sidered.  These  extension  are  alternate  line  groupings  of 
the  underlying  linear  segments.  Often  these  new  line  col¬ 
lations  will  include  new  linear  segments  in  the  extension, 
which  had  not  previously  been  included  in  the  same  lin¬ 
ear  structure  as  the  information  (in  terms  of  proximity 
and  collinearity)  in  the  context  of  the  lines  alone  was  not 
sufficient  to  trigger  the  grouping. 


Figure  5:  U  structures 

4.3  U  Structures 

Each  pair  of  parallel  lines  evolves  into  a  set  of  parallels 
and  the  ends  of  their  component  lines  are  aligned.  A  set 
of  parallel  lines  with  aligned  ends  is  a  strong  indication 
that  there  is  possibly  a  line  joining  those  ends  to  create  a 
U  shaped  structure. 

Thus  the  presence  of  a  parallel  with  aligned  ends  trig¬ 
gers  the  formation  of  another  collated  feature,  the  U  struc¬ 
ture.  The  U  collation,  or  rather  the  parallel  with  aligned 
ends,  gives  strong  suggestion  of  a  line  joining  the  two  ends. 
If  an  appropriate  line  joining  the  aligned  ends  does  not 
already  exist,  a  new  linear  collation  is  created.  This  new 
collation  may  incorporate  any  existing  linear  segments  or 
may  be  “virtual”;  again  the  perception  of  a  complex  struc¬ 
ture,  in  this  case  a  U,  triggers  the  formation  of  a  less 
evolved  collated  structure,  the  line. 


Figure  6:  Rectangles 


4.4  Rectangles 

Each  parallel  generates  two  U  structures  and  the  Us  of  a 
parallel  taken  together  form  a  rectangle.  Us  are  not  the 
only  components  of  the  rectangle,  its  percept  depends  on 
each  of  its  components,  the  lines,  the  parallels  and  the  Us. 
While  we  have  chosen  to  represent  the  rectangle  as  one 
collated  feature,  a  structure  of  similar  complexity  may  be 
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possibly  represented  by  various  collations,  each  making 
salient  a  different  structural  property  such  as  the  various 
symmetries.  For  us  a  single  representation  is  sufficient 
because  of  the  simplicity  of  the  shapes  in  our  domain  and 
the  fact  that  we  shall  not  make  use  of  the  symmetries  for 
any  descriptive  (or  other  visual)  purposes. 

5  Selection  of  Collated  Features 

The  detection  of  collated  features,  where  all  reasonable 
groupings  among  tokens  resulted  in  the  formation  of  col¬ 
lated  features,  is  followed  by  selection  where  only  the 
more  suitable  collated  features  are  retained.  The  detec¬ 
tion  and  selection  processes  could  proceed  simultaneously. 

At  each  level  of  collated  features  various  collations  are 
in  contention  as  they  provide  alternative  groupings  of  the 
underlying  tokens.  Also  some  collations  may  have  been 
formed  on  weak  evidence,  evidence  that  seems  too  weak 
when  compared  to  that  for  other  collated  features  at  that 
level.  A  selection  process  has  to  choose  “good”  collations, 
i.e.  those  which  have  high  probability  of  corresponding  to 
individual  object  parts. 

The  “goodness”  of  a  collated  feature  depends  on  how 
it  compares  to  its  alternatives  in  terms  of  the  support  it 
has  from  related  collations  at  other  levels,  and  the  sup¬ 
port  or  contradiction  from  its  component  primitive  fea¬ 
tures  and  other  related  image  features.  A  collated  feature 
is  not  supported  just  by  its  component  collations  but  also 
by  the  collations  it  itself  is  a  component  of.  The  later 
relationship  is  due  to  the  fact  that  the  percept  of  a  larger 
structure  strengthens  that  of  a  smaller  component  struc¬ 
ture,  for  example  as  the  percept  of  a  U  shape  strengthens 
the  percept  of  a  line  forming  the  base  of  the  U.  Thus  a 
line  and  the  U  it  belongs  to  are  mutually  supportive.  In 
general  terms,  collated  features  which  are  linked  by  part- 
of  relationships  are  mutually  su pportive  and  those  that 
share  component  collations  are  mutually  competitive. 

5.1  Constraint  Satisfaction  Networks 

The  collated  features  and  the  relationships  of  support  and 
conflict  among  them  naturally  define  a  network  with  the 
collations  serving  as  nodes  and  the  relationships  as  arcs. 
Our  goal  is  to  find  the  optima]  feature  groupings  consis¬ 
tent  with  the  known  optical  and  geometrical  constraints 
[14,15],  Note  that  all  the  constraints  must  be  simultane¬ 
ously  satisfied  to  reach  global  consistency  across  all  levels 
of  the  hierarchy. 

One  parallel  technique  to  solve  this  problem  is  relax¬ 
ation  where  a  cost  function  associated  with  the  network  is 
minimized.  We  wish  to  select  the  best  consistent  feature 
groupings,  and  reject  the  bad  groupings.  If  we  formulate 
the  cost  function  such  that  the  optimal  solution  corre¬ 


sponds  to  its  global  minima,  then  the  problem  of  locating 
the  best  groupings  reduces  to  that  of  optimizing  the  cost 
of  the  network  given  the  constraints  (defined  by  the  rela¬ 
tions)  between  the  collated  features  and  the  observed  im¬ 
age  characteristics.  Parallel  optimization  techniques  such 
as  simulated  annealing  [16],  Hopfield  networks  [17,18], 
Boltzman  machines  [19,15,20],  probabilistic  solutions  [21] 
and  connectionist  methods  [22,19]  have  been  proposed  for 
problems  which  can  be  formulated  as  Constraint  Satisfac¬ 
tion  Networks. 

We  use  a  slightly  modified  version  of  Hopfield  net¬ 
works  to  implement  the  constraint  satisfaction  network. 

5.2  Hopfield  Networks 

Following  the  notation  convention  of  Hopfield  and  Tank 
[23,24]  we  describe  the  behavior  of  each  node  in  the  net¬ 
work  by: 


dui/dt 

N 

=  -u,  4-  £  Tij  Vj  +  /,  -  hi 

j-l 

0) 

Vj 

-  sK) 

(2) 

hi 

=  resting  potential  of  nodet 

(3) 

where  Ty  is  weight  on  link  from  node  j  to  node  », 
is  the  total  input  to  node  t  and  V,  is  the  output  of  node 
i.  The  addition  of  A,-,  the  resting  potential,  is  useful  in 
adjusting  the  sensitivity  of  a  neuron  by  shifting  its  gain 
curve.  For  purposes  of  analysis  of  the  network,  the  resting 
potential  may  be  combined  with  the  input. 

When  the  network  has  symmetric  connections,  i.e. 
Tij  =  Tji,  the  network,  where  each  element  has  the  above 
equation  of  motion,  converges  to  stable  states.  This  prop¬ 
erty  has  been  shown  for  more  genera)  networks  by  Hum¬ 
mel  and  Zucker  [25].  Also  when  the  gain  function  g  is 
high  gain  (width  of  the  gain  curve  is  narrow),  the  stable 
states  of  the  N  elements  are  the  local  minima  of  the  fol 
lowing  cost  function  with  the  outputs  of  the  nodes  at  0 
or  1  [24]. 

E=~lT.i:  TijVjVj-Y.VJi  (4) 

“  *  3  • 

The  signs  in  the  above  cost  function  suggest  that  if 
we  wish  to  select  mutually  supporting  collations  and  re¬ 
ject  mutually  conflicting  collated  features,  the  weights  Ti; 
betweei.  supporting  hypotheses  should  be  positive  and 
that  between  conflicting  hypotheses  should  be  negative. 
Those  optical  and  geometrical  constraints,  which  are  not 
expressed  purely  via  the  interrelationships  between  the 
interpretations  should  be  fed  as  inputs  /,  to  the  nodes. 
Again  the  sign  in  equations  (1)  and  (4)  shows  that  sup¬ 
porting  evidence  should  be  included  as  positive  input  and 
contradicting  evidence  as  negative  input. 
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In  most  applications  of  Hopfield  networks,  the  prob¬ 
lem  is  explicitly  stated  as  the  minimization  of  a  particular 
energy  or  cost  function.  If  the  problem  is  naturally  formu¬ 
lated  in  terms  of  a  cost  or  energy  function,  for  example 
the  cost  of  traversing  a  path  in  the  traveling  salesman 
problem  [17]  or  the  energy  of  a  membrane  or  thin  plate 
in  surface  interpolation  [26],  the  network  is  constructed 
by  transforming  this  cost  function  to  equation  2.4  and  se¬ 
lecting  the  interconnections  TtJ  from  this  transformation. 
In  other  words,  when  there  is  an  energy  function  associ¬ 
ated  with  the  problem  it  can  be  used  to  specify  the  in¬ 
terconnections.  If  the  problem  is  more  suitably  described 
as  a  constraint  satisfaction  problem,  as  is  our  case,  with 
the  relationships  among  the  elements  specifying  the  con¬ 
straints,  it  is  more  natural  to  specify  the  interconnections 
from  these  relationships  rather  than  to  formulate  an  en¬ 
ergy  function  for  the  network. 

5.3  Selection  of  Constraints 

To  insure  the  selection  of  perceptually  significant  feature 
groupings  in  the  scene,  the  choice  of  weights  should  reflect 
the  perceptual  importance  placed  on  the  optical  and  geo¬ 
metric  constraints  between  the  various  collated  features. 
The  perceptual  significance  of  a  collated  feature  lies  in 
its  indication  of  actual  object  structure  in  the  scene.  For 
example,  while  any  grouping  of  parallel  lines  [11]  is  indica¬ 
tive  of  some  order  in  the  scene,  we  are  more  interested  in 
parallels  that  actually  correspond  to  individual  objects. 
Therefore,  the  parallels  that  have  supporting  structural 
evidence  such  as  rectangles  are  more  significant  than  those 
that  do  not. 

The  relationships  between  the  various  collated  fea¬ 
tures  are  represented  by  the  network  in  Fig.  7.  Collated 
features  which  support  each  other  are  connected  via  posi¬ 
tively  weighted  links  (bold  lines)  while  mutually  conflict¬ 
ing  collations  are  linked  via  negatively  weighted  links  (thin 
lines). 

As  Fig.  7  shows,  collated  features  at  different  levels  of 

the  hierarchy  which  share  image  features,  support  each 
other  while  groupings  of  the  same  level  which  share  im¬ 
age  features,  conflict  with  each  other.  The  details  of  the 
various  nodes  and  their  connections  are  as  follows: 

•  Lines:  Lines  have  two  components,  edges  and  gaps. 
Edges  refer  to  the  intensity  discontinuities  detected 
in  the  image,  gaps  are  sections  of  the  line  where  no 
edges  were  detected.  Presence  of  edges  supports  the 
line  hypothesis  while  the  presence  of  gaps  weakens 
it.  Lines  which  share  edges  are  in  conflict.  Presence 
of  corners  at  the  ends  of  the  line  strengthens  the 
percept  of  the  line. 

•  Gaps:  Lines  aligned  with  a  gap  support  the  line  to 
which  the  gap  belongs,  i.e.  they  weaken  the  gap  hy- 


Figure  7:  Constraint  Satisfaction  Network 


pothesis.  On  the  other  hand,  lines  crossing  across 
the  gap  strengthen  the  gap  hypothesis,  i.e.  contra¬ 
dict  the  line  percept. 

•  Parallels:  Parallels  are  supported  by  their  compo¬ 
nent  lines  and  by  the  Us  and  rectangles  they  are  a 
part  of.  The  various  parallels  formed  from  the  same 
line  are  in  competition. 

•  Us:  Us  are  supported  by  the  parallels,  the  line  at 
the  base  of  the  U  and  the  corners  at  the  junctions  of 
the  base  and  the  parallels.  Us  of  conflicting  parallels 
are  also  mutually  conflicting.  Lines  crossing  across 
the  base  of  the  U  form  evidence  against  it. 

•  Rectangles:  The  components  of  a  rectangle  and 
the  four  possible  corners  are  supportive  evidence 
for  it.  Different  rectangles  which  share  the  same 
edges  are  in  conflict,  and  so  are  rectangles  which 
overlap  each  other  (i.e.  share  areas  of  the  image). 
Surface  markings  indicating  texture  elements  inside 
the  rectangle  further  inhibit  the  rectangle  hypothe¬ 
sis. 

Each  collated  feature  supports  itself  (self  excitation). 
In  addition  to  the  relationships  described  above,  there  is 
global  competition  among  collated  features  of  the  same 
complexity.  This  is  to  insure  that  isolated  (i.e.  groupings 
which  have  no  alternative  groupings  for  their  component 
features)  but  poor  grouping  hypotheses  do  not  get  se¬ 
lected  just  due  to  the  absence  of  competition. 

The  weights  on  the  links  range  from  -1.0  to  1.0.  and 
are  in  the  proportion  to  the  importance  of  their  source  col¬ 
lation  as  supporting  or  conflicting  evidence  to  their  des¬ 
tination  collation.  'Ine  ..election  of  the  weights  has  been 
random  within  the  confines  of  the  above  constraints.  We 
have  found  that  the  network  is  not  sensitive  to  even  large 
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changes  in  the  weights  if  the  total  amount  of  evidence 
(YljLi  TijVj  +  Ij)  arriving  at  the  nodes  does  not  change 
drastically.  The  network,  in  its  present  form  does  not 
have  the  ability  to  “self  calibrate”  i.e.  to  automatically 
adjust  the  sensitivity  of  a  node  on  the  basis  of  the  to¬ 
tal  amount  of  information  arriving  at  it.  The  sensitivity 
of  the  nodes  can  be  controlled  by  using  a  bias  similar  to 
the  resting  potential  of  neurons.  By  controlling  the  bias 
we  control  the  amount  of  positive  evidence  required  by  a 
node  to  fire  it. 

While  feature  groupings  at  all  levels  of  complexity  get 
selected  simultaneously,  only  the  selected  rectangles  have 
been  displayed  in  Fig.  8.  In  our  implementation,  the 
weights  on  the  links  are  not  symmetric,  so  the  convergence 
results  for  Hopfield  networks  can  not  be  used.  However 
there  is  support  that  the  networks  can  converge  under 
non-symmetric  weights  [27].  We  have  found  our  networks 
to  converge  on  all  our  selection  of  weights  within  ten  it¬ 
erations. 

It  may  be  useful  to  note  here  that  unlike  many  other 
applications  of  “neural-networks”  to  computer  vision  [19], 
each  node  represents  a  high  level  feature.  The  features  to 
be  represented  and  the  relationships  between  them  have 
been  selected  by  us.  This,  however,  does  not  require  the 
network  to  be  made  by  hand;  the  nodes  and  their  rela¬ 
tionships  get  automatically  formed  as  a  byproduct  of  the 
detection  process. 

6  Applications  to  Stereo 

There  is  a  strong  belief  in  the  computer  vision  commu¬ 
nity  that  stereo  correspondence  precedes  any  analysis  of 
the  structures  in  the  image.  This  belief  started  with  the 
experiments  by  Julesz  [4]  and  has  been  demonstrated  in 
the  design  of  stereo  algorithms  using  edges  or  intensity 
as  matching  primitives.  Julesz’s  experiment  just  showed 
that  stereo  correspondence  can  use  primitive  image  ele¬ 
ments  before  any  grouping  is  performed  on  them;  it  does 
not  restrict  other,  more  structurally  evolved,  represen¬ 
tations  from  being  used  by  the  stereooptic  process.  In 
fact,  the  difficulty  encountered  by  humans  in  fusing  ran¬ 
dom  dot  stereograms  indicates  that  the  stereo  process  is 
working  with  a  smaller  class  of  representations  than  it 
usually  does.  More  recently,  stereo  systems  have  shown 
improved  performance  by  using  more  structure  (than  in¬ 
dividual  edges)  [28,29,30,31,32]. 

Collated  features  are  rich  representations.  They  en¬ 
code  particular  structural  relationships  at  a  particular 
scale  of  description.  Matching  collated  features  thus  in¬ 
volves  less  ambiguity  than  edges  as  there  are  less  possible 
alternatives  and  more  inform.' tion  to  judge  a  match.  Also 
there  are  usually  much  less  collated  structures,  at  any 
given  representation  level  than  edges.  The  most  prob¬ 
able  role  of  collated  features,  and  one  that  we  employ 
here,  is  that  correspondence  of  collated  features  provides 


a  rough  correspondence  for  their  component  primitive  fea¬ 
tures,  which  can  then  be  matched  with  less  ambiguity. 


For  the  vision  system  CANC  we  use  the  rectangle  col¬ 
lated  features  to  aid  stereo  matching.  For  this  visual  do¬ 
main,  edge  and  segment  based  stereo  matching  algorithms 
displayed  poor  performances.  The  following  factors  indi¬ 
cate  why  stereo  systems  based  on  simple  image  features 
may  not  perform  well  in  this  domain: 


1.  Organized  nature  of  the  scene.  The  are  numerous 
parallel  lines  since  the  building-wings,  roads,  park¬ 
ing  lots,  etc.  are  all  parallel.  Roof  borders  and 
their  shadows  and  road  markings  also  give  rise  to 
close  parallel  lines.  This  leads  to  many  ambiguous 
matches  and  it  is  difficult  to  resolve  among  the  com¬ 
peting  matches. 

2.  Absence  of  texture.  The  buildings  sides  represent 
areas  of  high  disparity  change  and  there  are  in¬ 
sufficient  markings  on  the  roofs  to  support  match- 
disparities  at  roof  level  while  matches  giving  low 
disparities  get  favored  due  to  the  preponderance  of 
features  on  the  ground. 

The  choice  of  rectangles  restricts  the  possible  matches. 
These  can  be  further  constrained  by  observing  that  roofs 
are  parallel  to  the  imaging  plane.  This  implies  that  the 
disparity  along  the  side  of  a  rectangle,  corresponding  to 
a  segment  of  a  roof,  remains  constant.  Due  to  occlusion, 
however,  the  disparities  of  the  different  sides  of  a  rectangle 
can  all  be  different  with  the  disparity  of  the  side  corre¬ 
sponding  to  the  disparity  of  the  object  occluding  the  rect¬ 
angle.  Since  the  area  bounded  by  a  rectangle  is  assumed 
to  correspond  to  a  horizontal  roof,  the  rectangle  can  be 
assigned  the  minimum  disparity  of  its  sides,  assuming  the 
sides  with  greater  heights  to  belong  to  occluding  objects. 
In  the  case  of  a  hole  in  the  roof  or  in  extreme  cases  of 
occlusion,  the  disparity  of  the  area  bounded  by  the  rect¬ 
angle  may  be  different  from  that  of  any  of  its  sides.  For 
proper  handling  of  such  cases  the  disparity  of  the  area  in¬ 
side  the  rectangle  should  be  obtained,  but  obtaining  reli¬ 
able  matches  with  the  small  amount  of  texture  commonly 
found  on  roofs  leads  us  into  issues  in  stereo  matching  that 
we  do  not  wish  to  address  here. 

Finding  a  match  between  two  rectangles  corresponds 
to  assigning  a  unique  one-to-one  correspondence  between 
the  sides  of  the  rectangles.  Thus  to  check  if  two  rectan¬ 
gles  match  we  have  to  check  if  sides  of  the  two  match  in 
order;  i.e.  if  side  ab  of  rectangle  A  matches  side  c d  of 

rectangle  B  then  all  the  sides  of  A  on  the  left  (or  right) 
of  ab  must  match  all  the  sides  of  B  on  the  left  (or  right) 
of  cd.  As  mentioned  earlier,  the  sides  of  the  rectangles 
of  interest  are  parallel  to  the  imaging  plane.  Therefore, 
two  sides  are  possible  matches  if  they  are  nearly  parallel 
and  lie  within  the  epipolar  window  of  each  other  (i.e.  if 
ab  matches  cd  then  at  least  c  or  d  should  lie  within  the 
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interval  defined  by  the  epipolar  lines  of  a  and  i).  The 
presence  of  L  junctions  provide  an  added  constraint  that 
the  two  sides  joined  by  the  junction  have  the  same  dis¬ 
parity,  whether  the  sides  belong  to  the  area  bounded  by 
the  rectangle  or  to  an  occluding  object  (which  we  assume 
for  roofs  can  only  be  another  roof).  If  the  sides  of  two 
rectangles  match  in  order  meeting  the  above  constraint 
then  we  hypothesize  a  possible  match  between  the  two. 
Like  other  stereo  matching  systems  we  allow  only  matches 
falling  within  a  disparity  range  reasonable  for  the  stereo 
pair.  To  avoid  mistaking  rectangles  corresponding  to  ten¬ 
nis  courts,  parking  lots  and  the  like,  the  legal  disparity 
range  should  start  just  above  ground  level.  The  other  end 
of  the  interval  should  be  high  enough  to  encompass  the 
tallest  buildings  in  the  scene.  This  estimate  need  not  be 
exact  as  possible  wrong  matches  between  rectangles  re¬ 
sult  in  disproportionate  disparities.  For  our  test  cases  we 
choose  an  ad  hoc  value  which  was  more  than  twice  the 
disparity  of  the  tallest  building  in  any  of  the  test  scenes. 

The  key  problem  with  general  stereo  systems  is  the 
ambiguity  in  matching  necessitating  a  mechanism  to  choose 
one  among  many  competing  matches  for  each  match  prim¬ 
itive.  For  this  system  we  have  found  the  constraints  im¬ 
posed  by  the  structure  of  the  collated  feature  sufficient  to 
select  unique  matches  for  the  primitives  (rectangles).  In 
the  rare  case  of  a  rectangle  finding  more  than  one  match, 
we  choose  the  match  with  the  least  disparity  difference 
between  the  sides,  which  is  equivalent  to  preferring  the 
least  occluded  interpretation. 

Stereo  can  serve  as  an  important  visual  clue  in  select¬ 
ing  those  collated  features  which  have  a  very  good  chance 
of  corresponding  to  actual  object  structures,  in  this  case 
the  roofs.  Selection  of  the  proper  collated  features  is  cru¬ 
cial  for  this  domain  as  many  other  objects  in  the  scene 
such  as  road  segments,  parking  lots  and  sidewalks  have 
rectangular  structures.  Furthermore,  these  objects  are  ar¬ 
ranged  in  a  regular  grid  like  manner,  and  some  collated 
features  formed  reflect  the  structure  in  the  layout  of  the 
scene  rather  than  that  of  specific  objects.  In  general, 


(»)  Rifll  Image  (k)  ^  Iou<' 

Figure  10:  Rectangles  matched  by  stereo 

objects  in  a  scene  are  not  organized  in  a  regular  fashion, 
and  other  sources  of  visual  information  such  sis  stereo  may 
not  be  required  for  aiding  the  selection  process.  The  rect¬ 
angle  collations  which  are  components  of  the  roof  shapes 
have  heights  above  the  ground,  and  the  disparities  of  their 
sides  lie  within  ranges  reasonable  for  the  stereo  pair.  The 

rectangle  collated  features  that  meet  this  criteria  are  se¬ 
lected  out  form  the  rest.  These  have  high  probability  of. 
belonging  to  roofs  or  parts  of  roofs.  Rectangle-groupings 
so  selected  are  shown  in  Fig.  9. 

There  is  a  loss  of  accuracy  in  the  determination  of  the 
disparities  as  a  result  of  the  robustness  in  the  detection  of 
the  matched  primitives.  The  rectangles  are  collated  fea¬ 
tures,  and  are  thus  primarily  structural  representations 
with  low  positional  accuracy.  The  component  lines  of  the 
rectangle  only  represent  the  structure  among  the  under¬ 
lying  edges,  not  their  positions.  For  obtaining  accurate 
disparity,  matching  of  more  precisely  located  features, 
namely  the  edges,  is  required.  The  lines  of  the  rectan¬ 
gles  are  replaced  by  the  linear  segments  they  represented 
and  these  are  then  matched.  There  is  little  ambiguity  in 
matching  linear  segments  at  this  stage,  since  the  linear 
segments  of  a  line  are  matched  only  to  the  linear  seg¬ 
ments  of  the  corresponding  matching  line  found  during 
matching  the  rectangles.  While  each  line  might  represent 
multiple  close  parallel  linear  segments,  we  choose  only  the 
innermost  edges  for  match.  The  rectangles  represent  ar- 
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eas  bounded  by  the  lines  and  so  the  inner  edges  are  a  sen¬ 
sible  representation  of  the  rectangle.  This  is  a  convenient 
approximation  and  may  not  be  correct  for  all  cases.  We 
are  currently  working  on  using  sensitive  edge  detectors  on 
magnified  portions  of  the  image  in  small  windows  around 
the  lines  for  precise  detection  and  location  of  edges.  We 
can  consider  even  weak  edges  near  the  noise  level  of  the 
image,  since  we  have  an  idea  of  the  direction  of  the  edges 
and  their  geometry  (straight  lines)  and  an  approximate 
idea  of  their  location. 

7  Applications  to  Shape  Descrip¬ 
tion  and  Object  Extraction 

We  combine  the  collated  features  into  structures  corre¬ 
sponding  to  the  objects.  The  compositions  of  the  col¬ 
lations  provide  a  basis  for  the  description  of  the  final 
shape.  The  compositions  themselves  identify  individual 
object  parts  (object  extraction).  Since  our  choice  for  the 
collated  feature  corresponding  to  the  rectangle,  identifies 
the  same  structure  as  the  shape  primitive  (rectangle)  we 
employ  to  describe  the  object  shapes,  no  translation  from 
the  shapes  of  the  collated  features  to  the  shapes  of  the 
description  primitives  is  required. 

The  task  of  combining  the  collated  features  into  shapes 
has  to  be  mediated  by  the  process  of  visual  reasoning. 
Since  the  shapes  arising  from  the  combinations  corre¬ 
spond  to  individual  objects,  some  of  them  may  not  ex¬ 
hibit  the  structural  regularity  of  the  collated  features  and 
thus  may  not  directly  correspond  to  a  collated  feature  but 
may  have  to  be  derived  by  combining  collated  features. 
The  visual  reasoning  carried  out  currently  is  primarily 
monocular,  augmented  by  stereo  as  needed. 

In  contrast  to  previous  uses  of  monocular  analysis, 
we  work  with  more  organized  structures  than  lines  and 
junctions.  Also  T  junctions,  which  are  a  key  element  in 
monocular  analysis  can  not  be  utilized  for  this  applica¬ 
tion  domain  because  of  the  presence  of  false  T  junctions 
due  to  alignment.  As  with  other  phases  of  processing,  the 
reader  will  note  that  the  organized  nature  of  the  primi¬ 
tives  used  for  processing  bring  more  1  ^imation  to  the 
monocular  analysis  than  is  available  w,.h  just  edge  and 
junction  information. 

The  structural  relationships  being  considered  are  those 
of  subsumption  or  inclusion,  merger-compatibility,  occlu¬ 
sion  and  incompatibility.  Let  the  shapes  be  defined  by 
their  boundaries.  Consider  two  structures  A  and  B.  The 
bounding  contours  of  the  structures  correspond  to  group¬ 
ings  of  the  underlying  intensity  edges.  Sections  of  the 
boundary  thus  correspond  to  grouped  edge-contours,  sin¬ 
gle  edge-contours  or  are  abstractions  generated  by  the 
grouping  process. 

We  find  the  intersections  between  the  boundaries  of  A 
and  B.  These  intersections  divide  the  boundary  of  each 


structure  into  contour-segments.  The  contour  segments 
of  each  structure  are  then  assigned  to  one  of  three  dis¬ 
joint  sets,  one  containing  segments  that  lie  outside  the 
other  structure,  one  containing  segments  that  lie  inside 
the  other  structure  and  one  containing  segments  shared  by 
the  other  structure.  Since  the  positioning  of  the  bound¬ 
aries  is  approximate,  allowances  have  to  be  made  during 
the  computation  of  these  sets  to  account  for  these  inaccu¬ 
racies  (for  example  close  parallel  and  overlapping  bound¬ 
aries  may  be  termed  “shared”). 

Oab  ■  Set  of  contour  segments  of  A  outside  B 
I  as  •'  Set  of  contour  segments  of  A  inside  B 
Sab  '■  Set  of  contour  segments  shared  by  A 
and  B,  Sab  =  Sba 

Edg(X)  :  Set  of  edges  of  the  set  of  contour 
segment,  X. 

Subsumption:  If  the  outside-segment  set  of  shape  A  is 
empty  and  the  shared  segment  set  non-empty  and  the 
edge-support  for  segments  in  the  inside-segment  set  is 
poor  or  non-existent,  then  we  say  that  structure  A  is  sub¬ 
sumed  by  structure  B  and  can  be  removed. 

(Oab  =  4>)1\{Sab  £  4>)A(Edg(IAB)  <  t)  =>  B  subsumes  A 

The  following  relationships  are  only  checked  when  sub¬ 
sumption  is  not  present. 

Occlusion:  If  the  contour- segments  of  A  inside  B  have 
strong  edge  support  and  those  of  B  inside  A  have  weak 
intensity  edge  support  then  we  can  assume  that  shape  A 
occludes  shape  B.  Note  that  this  applies  even  if  the  rest 
of  the  contour-segments  of  A  and  B  belong  to  the  shared 
set  or  outside  set. 

(Edg(IAB)  >  f)  A  ( Edg(lBA )  <  r)  =>  A  occludes  B 


Figure  11:  Only  possible  combination  of  rectangles  in  Fig. 
10. b 

Merger-Compatibility  If  the  segments  in  the  inside- 
segment  and  shared-segment  set  for  both  shapes  A  and  B 
have  poor  edge  support  then  we  can  conclude  that  A  and 
B  represent  segmentation  of  the  one  structure  into  two 
parts  and  can  thus  be  merged  into  one  structure.  The 
merger  operation  is  that  of  union.  Note  that  an  implicit 
assumption  has  been  made  that  the  outside-segment  set 
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of  A  is  non-empty  since  each  shape  has  to  have  some 
edge  support  for  it  to  be  formed.  The  one  combination 
of  rectangles  resulting  from  this  process  on  Fig.  10. b  is 
shown  in  Fig.  11. 

( Edg{IAB )  <  r)A (Edg(IBA)  <  r)A(Edg(SAB)  <  r)  =»  ADB. 

If  on  the  other  hand,  the  outside-segment  set  of  A  is 
empty  and  the  shared-segment  set  of  A  has  poor  edge 
support  then  A  and  B  are  merged  using  the  difference 
operation  B  -  A. 

(Edg(IAS)  >  t)A(Iba  =  <f>)A (Oab  =  4>)A(Edg(SAB)  <  r) 

=>  B-A. 

If  stereo  information  is  present,  it  should  be  checked 
that  the  heights  of  A  and  B  are  compatible  since  edge 
support  could  be  lacking  in  shared-segments  of  adjoining 
objects  of  similar  surface  properties  due  to  the  absence  of 
contrast. 

Unrelated:  If  A  and  B  have  null  inside-segment  sets 
and  null  shared-segment  sets  then  they  are  unrelated.  If 
A  and  B  have  non  empty  shared  segment  set  (and  null 
inside-segment  sets)  but  the  shared  segments  have  good 
edge  support  then  A  and  B  are  still  unrelated  (though 
adjoining). 

(Iab  =  WWba  =  4>)A((Sab  =  4>)V(Edg(SAB )  >  t)) 

A  and  B  are  unrelated. 

Incompatible  If  A  and  B  have  non-empty  inside-segment 
sets  and  the  elements  of  the  inside-segments  of  both  A  and 
B  have  strong  edge  support  then  at  least  one  of  A  or  B  is 
a  wrong  structural  grouping  and  must  be  deleted. 

( Iab  /  4>)^{Iba  ±  4>)A(EDdgIAB  >  r)A(Edg(IAB)  >  r)) 

=>  A  and  B  are  incompatible. 

The  decision  of  which  structure  is  erroneous  is  difficult 
to  make  in  the  context  of  just  two  structures,  one  could 


(ft)  Right  Image  (b)  Left  Imagt 

Figure  12:  Final  combination  of  rectangles  (correspond¬ 
ing  to  roofs) 

Starting  from  the  rectangles  selected  from  the  previ¬ 
ous  stages,  we  perform  the  above  analysis  on  all  pairs  of 
rectangles,  first  removing  subsumed  structures  and  then 
forming  new  structures  on  any  possible  mergers  of  rectan¬ 
gles.  The  process  is  recursively  applied  to  the  new  struc¬ 
tures  along  with  the  original  structures  from  the  previous 
step  until  no  new  structures  are  formed.  During  this  com¬ 
bination,  duplication  of  the  structures  is  possible,  but  it 
is  trivial  to  detect  duplicate  structures  since  they  have 
exactly  the  same  component  rectangles. 

The  geometrical  relationships  among  the  shape  prim¬ 
itives  (rectangles)  and  their  combinations  form  a  graph 
which  is  a  structural  description  of  the  objects  in  the 
scene  in  terms  of  the  primitive.  Structures  in  the  graph 
which  are  not  marked  as  subsumed,  merged  or  incompati¬ 
ble  are  selected  as  the  top  level  descriptions  of  the  objects 
or  object  parts  visible  in  the  scene  (roofs  for  our  image 
domain). 

The  final  structures  are  assigned  heights  from  the  dis¬ 
parity  information  previously  obtained  by  stereo.  The 
buildings  are  modeled  by  drawing  walls  straight  down 
from  the  from  the  sides  of  the  roofs  to  the  piane  below, 
be  it  of  another  roof  or  the  ground.  The  resulting  model 
is  displayed  in  Fig.  13. 

Results  on  two  more  image  pairs  are  shown  in  the 


possibly  retain  the  structure  with  more  edge  and  corner 
support  inside  the  other.  If  A  and/or  B  conflict  with 
other  structures  then  the  one  with  the  most  conflicts  can 
be  deleted  first,  and  so  on.  In  the  case  of  a  tie  where 
we  are  left  with  a  pair  of  mutually  conflicting  structures, 
other  information  such  as  stereo  could  be  used  to  resolve 
conflicts. 

Our  current  system  reports  on  the  conflicts  but  does 
not  resolve  them.  In  our  test  cases  we  had  only  one  case 
of  conflicting  structures,  and  there  the  decision  was  easy 
to  make  (manually)  as  one  of  the  structures  was  not  only 
conflicting  with  a  number  of  other  structures  but  was  also 
being  “occluded”  by  a  structure  of  lower  height  (as  re¬ 
ported  by  stereo)  than  itself. 


following  figures. 


Figure  13:  Rendered  view  of  3D  model  of  building  detected 
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Figure  22:  Rectangles  selected  by  CSN 
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Figure  23:  Rectangles  matched  by  stereo  (none  of  the  rectangles  merge,  the  final  shapes 
formed,  corresponding  to  the  roofs  is  the  same) 
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Figure  24:  Rendered  new  of  building  for  Aerial  Image  III 

8  Conclusions 

We  have  proposed  collated  features  as  the  representations 
computed  by  the  process  of  perceptual  organization  ap¬ 
plied  to  the  primitive  image  elements.  These  collations 
represent  structural  relationships  between  the  arrange¬ 
ment  of  their  tokens.  We  have  identified  the  structural 
relationships  so  represented,  in  terms  of  their  significance 
for  the  shapes  in  our  visual  domain  and  the  utility  to 
other  visual  processes.  Further  we  have  shown  that  col¬ 
lated  features  are  useful  for  stereo  and  the  generation  of 
shape  descriptions  and  object  segmentation. 


We  have  illustrated  our  theory  with  a  working  vision 
system  which  works  on  real  images  from  a  restricted  shape 
domain.  The  fact  that  other  techniques  such  as  model 
matching  are  not  applicable,  and  that  attempts  at  using 
techniques  based  on  primitive  image  elements  (such  as 
contour  tracing)  have  been  unsuccessful  on  this  domain, 
supports  the  need  and  usefulness  of  collated  features. 
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ABSTRACT 

A  novel  technique  called  dynamic  model  matching  (DMM)  is 
presented  for  target  recognition  from  a  moving  platform  such  as  an 
autonomous  combat  vehicle.  The  DMM  technique  overcomes  major 
limitations  in  present  model-based  target  recognition  techniques  that  use  a 
single,  static  target  model,  and  therefore  cannot  account  for  continuous 
changes  in  the  target’s  appearance  caused  by  varying  range  and  perspective. 
DMM  addresses  this  problem  by  combining  a  moving  camera  model.  3-D 
object  models,  spatial  models,  and  expected  range  and  perspective  to 
generate  multiple  2-D  image  models  for  matching.  DMM  also  generates 
recognition  strategies  that  can  emphasize  different  object  features  at 
varying  ranges.  DMM  operates  within  a  larger  system  for  landmark 
recognition  based  on  the  perception,  reasoning,  action,  and  expectation 
paradigm  called  PREACTE.  Results  arc  presented  on  a  number  of  test  sites 
using  color  video  data  obtained  from  the  autonomous  land  vehicle. 

1.  INTRODUCTION 

Target  recognition  from  a  mobile  platform  such  as  an  autonomous 
combat  vehicle  in  outdoor  scenarios  presents  one  of  the  most  challenging 
problems  of  the  machine  vision  community.  It  requires  the  ability  to 
recognize  targets  from  varying  ranges  and  perspectives  under  changing 
environmental  conditions.  Earlier  approaches  emphasized  the  need  for 
rotation-invariant  and  range-independent  target  models.1'2  It  was  soon 
evident,  however,  that  these  models  are  weak  because  they  have  few 
parameters  and  cannot  adequately  handle  different  aspects  at  different 
ranges.  Weak  segmentation  methods  further  aggravate  the  recognition 
problem. 

Landmark  recognition  is  a  typical  application  of  target  recognition  for 
autonomous  vehicles.  It  is  used  to  update  the  land  navigation  system  that 
accrues  a  significant  amount  of  error  after  the  vehicle  traverses  long 
distances,  which  is  typically  the  case  in  surveillance  search  and  rescue 
missions.  The  vision  system  of  the  autonomous  vehicle  is  required  to 
recognize  the  landmarks  as  the  vehicle  approaches  from  the  road  or  on 
terrain. 

We  have  developed  an  expectation-driven,  knowledge-based  landmark 
recognition  system  called  PREACTE  that  uses  a  priori,  map  and  landmark 
knowledge,  spatial  reasoning,  and  a  novel  dynamic  model  matching 
(DMM)  technique.  PREACTE's  mission  is  to  predict  and  recognize 
landmarks  and  targets  as  the  vehicle  approaches  them  from  different 
perspective  angles  at  varying  ranges.  Once  the  landmarks  have  been 
recognized,  they  are  associated  with  specific  map  coordinates,  which  are 
then  compared  to  the  land  navigation  system  readings,  and  subsequent 
corrections  are  made.  Landmarks  of  interest  include  buildings,  gates, 
poles,  and  other  man-made  objects. 

DMM  departs  from  previous  model-based  and  prediction-based  vision 
systems^'4  by  addressing  the  following  requirements: 

•  Target  models  are  dynam  ic . 

•  Different  targets  require  different  representation  and  modeling 
techniques. 

•  Single  targets  require  hybrid  models. 

*This  research  was  supported  by  DARPA  under  Contract  No.  DACA76- 
86-C-0017. 


•  At  different  ranges,  different  matching  and  recognition  plans  need 
to  be  performed. 

DMM  generates  and  matches  target  landmark  and  map  site 
descriptions  dynamically  based  on  different  ranges  and  perspectives. 

These  descriptions  are  a  collection  of  spatial,  feature,  geometric,  and 
semantic  models.  From  a  given  (or  approximated)  range  and  view  angle, 
and  using  a  priori  map  information,  3-D  landmark  models,  and  the  camera 
model,  PREACTE  generates  predictions  about  the  individual  landmark 
location  in  the  2-D  image.  The  parameters  of  all  models  are  a  function  of 
range  and  view  angle.  As  the  vehicle  approaches  the  expected  landmark, 
the  image  content  changes,  which  in  turn  requires  updating  the  search  and 
match  strategies.  Landmark  recognition  in  this  framework  is  divided  into 
three  stages:  detection,  recognition,  and  verification.  At  far  ranges,  only 
"detection"  of  distinguishing  landmark  features  is  possible,  whereas  at 
close  ranges,  recognition  and  verification  are  more  feasible,  since  more 
details  of  objects  are  observable. 

In  the  following  sections  we  present  a  brief  description  of  PREACTE, 
details  of  DMM,  and  show  results  on  real  imagery.  More  details  on 
PREACTE  can  be  found  in  Nasr  and  Bhanu.®'® 

2.  CONCEPTUAL  APPROACH 

The  task  of  visual  landmark  recognition  in  the  autonomous  combat 
vehicle  scenario  can  be  categorized  as  uninformed  or  informed.  In  the 
uninformed  case,  given  a  map  representation,  the  vision  system  attempts  to 
attach  specific  landmark  labels  to  image  objects  of  an  arbitrary  observed 
scene  and  infers  the  location  of  the  vehicle  on  the  map  (world).  In  this 
case,  image  to  map  registration  and  spatial  or  topological  information  about 
the  observed  objects  is  typically  used  to  infer  their  identity  and  the  location 
of  the  robot  on  the  map  as  a  result  In  the  informed  case,  while  the  task  is 
the  same  as  before,  there  is  a  priori  knowledge  (with  a  certain  level  of 
certainly)  of  the  past  location  of  the  robot  on  the  map  and  its  velocity.  It  is 
the  informed  case  that  is  of  interest  in  this  paper. 

Figure  1  illustrates  the  overall  approach  to  PREACTE's  landmark 
recognition  task.  It  is  a  top-down,  expectation-driven  approach,  whereby 
an  expected  site  model  (ESM)  on  the  map  is  generated  based  on  extensive 
domain-dependent  knowledge  of  the  current  (or  projected)  location  of  the 
vehicle  on  the  map  and  its  velocity.  The  ESM  contains  models  of  the 
expected  map  site  and  its  landmarks.  These  models  provide  the  hypotheses 
to  be  verified  by  a  sequence  of  images  acquired  at  predicted  times  t,  given 
the  velocity  of  the  vehicle  and  the  distance  between  the  current  site  and  the 
predicted  one.  Figure  2  illustrates  this  concept.  As  shown,  map  site 
models  introduce  spatial  constraints  on  the  locations  and  distributions  of 
landmarks,  using  a  "road"  model  as  a  reference.  Spatial  constraints  greatly 
reduce  the  search  space  while  attempting  to  find  a  correspondence  between 
the  image  regions  and  a  model.  This  mapping  is  usually  many-to-one  in 
complex  outdoor  scenes  because  of  imperfect  segmentation. 

The  ESM  is  dynamic  in  the  sense  that  the  expectations  and 
descriptions  of  different  landmarks  are  based  on  different  ranges  and  view 
angles.  Multiple  and  hybrid  landmark  models  are  used  to  generate 
landmark  descriptions  as  the  robot  approaches  a  landmark,  leading  to 
multiple  model/image  matching  steps.  This  is  what  is  referred  to  as 
dynamic  model  matching  (DMM).  The  landmark  descriptions  are  based  on 
spatial,  feature,  geometric,  and  semantic  models.  There  are  two  types  of 
expectations:  range  dependent  and  range  independent.  Range-dependent 


527 


.  -V -V-'-  -O'. 


expectations  are  landmark  features  such  as  size,  length,  width,  volume,  etc. 
Range-independent  ones  include  color,  perimeter  squared  over  area,  length 
over  width,  shape,  etc. 

Different  landmarks  require  different  strategies  and  plans  for  detection 
and  recognition  at  different  ranges.  For  example,  a  yellow  gate  has  a 
distinctive  color  feature  that  can  be  used  to  cue  the  landmark  recognition 
process  and  reduce  the  search  space.  A  telephone  pole,  on  the  other  hand, 
requires  the  emphasis  of  the  length/width  feature. 

In  PREACTE,  given  the  vehicle  position  in  the  map  and  an  acquired 
image,  PREACTE  performs  the  following  steps: 

1.  Generate  2-D  descriptions  from  3-D  models  for  each  landmark 
expected  in  the  image. 

2.  Find  the  focus  of  attention  areas  (FOAAs)  in  the  2-D  image  for 
each  expected  landmark. 

3.  Generate  the  recognition  plan  to  search  for  each  landmark,  which 
includes  what  features  will  be  used  for  each  landmark  and  at  what 
range. 

4.  Generate  the  ESM  at  that  range  and  aspect  angle. 

5.  Search  for  regions  in  the  FOAA  of  the  segmented  image  that  best 
match  the  features  in  the  model. 

6.  Search  for  lines  in  the  FOAA  in  the  line  image  that  best  match  the 
lines  generated  from  the  3-D  geometric  model  (this  step  is 
performed  at  close  ranges  where  details  can  be  observed). 

7.  Match  expected  landmark  features  with  region  attributes,  and 
compute  matching  confidences  for  all  landmarks. 

8.  Correct  the  approximated  range  by  using  the  size  differences  of 
the  suspected  landmark  in  the  current  and  previous  frames. 

9.  Compute  the  uncertainty  about  the  map  site  location  based  on  the 
previous  and  current  matching  results. 

2.1.  MAP/LANDMARK  KNOWLEDGE  BASE 

Extensive  map  knowledge  and  landmark  models  are  fundamental  to 
this  recognition  task.  Our  map  representation  relies  heavily  on  declarative 
and  explicit  knowledge  instead  of  procedural  methods  on  relational 
databases.  The  map  is  represented  as  a  quadtree,  which  in  turn  is 
represented  in  a  hierarchical  relational  network.  All  map  primitives  are 
represented  in  a  schema  structure.  The  map  dimensions  are  characterized 
by  their  cartographic  coordinates.  This  schema  representation  provides  an 
object-oriented  computational  environment  that  supports  the  inheritance  of 
properties  by  different  map  primitives  and  allows  modular  and  flexible 
means  for  searching  the  map  knowledge  base.  The  map  sites  between 
which  the  vehicle  traverses  have  been  surveyed  and  characterized  by  site 
numbers.  A  large  database  of  information  is  available  about  these  sites. 
This  includes  approximate  latitude,  longitude,  elevation,  distance  between 
sites,  terrain  descriptions,  landmark  labels  contained  in  a  site,  etc.  Such  site 
information  is  represented  in  a  SITE  schema,  with  corresponding  slots. 
Slots  names  include  HASJLANDMARKS,  NEXT_SITE,  LOCATION,  etc. 

Each  map  site  that  contains  landmarks  of  interest  has  an  explicitly 
stored  spatial  model,  which  describes  in  3-D  the  location  of  the  landmarks 
relative  to  the  road  and  to  each  other.  By  using  a  detailed  camera  model, 
range,  and  azimuth  angle,  we  can  generate  2-D  views  of  the  landmarks  as 
shown  in  Figure  3.  These  views  contain  symbolic  and  numeric  descriptions 
of  the  landmarks  and  their  parts. 

Given  a  priori  knowledge  of  the  vehicle's  current  location  on  the  map 
space  and  its  vetocity,  it  is  possible  to  predict  the  upcoming  site  that  will  be 
traversed  through  the  explicit  representation  of  map  knowledge.  The  ESM 
contains  information  about  the  predicted  (x,y)  location  of  a  given  landmark 
and  its  associated  FOAA,  which  is  an  expanded  area  around  the  predicted 
location  of  the  object. 


2.2.  OBJECT  MODELING 

Landmark  predictions  are  based  on  stored  map  information,  object 
models,  and  the  camera  model.  Each  landmark  has  a  hybrid  model  that 
includes  spatial,  feature,  geometric,  and  semantic  information.  Figure  4 
illustrates  this  hybrid  model  representation  for  a  yellow  gate;  this  model 
also  includes: 

•  Map  location 

•  Expected  (x,y)  location  in  the  image 

•  Location  with  respect  to  the  road  (i.e.,  left  or  right)  and 
approximate  distance 

•  Location  in  3-D 

The  feature-based  model  includes  information  about  local  features, 
such  as  color,  texture,  intensity,  size,  length,  width,  shape,  elongation, 
perimeter  squared  over  area,  linearity,  etc.  The  values  of  most  of  the  range- 
dependent  features,  such  as  the  size,  length,  width,  etc.,  are  obtained  from 
the  generated  geometric  model  at  that  given  range  and  azimuth  angle. 
Range-independent  feature  values  are  obtained  from  visual  observations 
and  training  data.  Different  parts  of  the  yellow  gate  are  represented  in  a 
semantic  network. 

3.  DYNAMIC  MODEL  MATCHING 

Each  landmark  has  a  number  of  dynamic  models,  as  shown  in 
Figure  1.  The  predicted  landmark  appearance  is  a  function  of  the 
estimated  range  and  view  angle  to  the  object.  The  range  and  view  angle  are 
initially  estimated  from  prior  locations  of  the  vehicle,  map  information,  and 
velocity;  they  can  be  corrected  based  on  recognition  results.  The  landmark 
recognition  task  is  performed  dynamically  at  a  sampled  clock  rate. 
Different  geometric  models  are  used  for  different  landmarks;  for  example, 
telephone  poles  can  be  best  represented  as  generalized  cylinders,  whereby 
buildings  are  belter  represented  as  wire  frames.  The  different 
representations  require  the  extraction  of  different  image  features. 

There  are  three  basic  steps  to  the  landmark  recognition  process  after 
generating  the  prediction  of  the  next  expected  site  and  its  associated 
landmarks.  These  are  1)  landmark  detection,  2)  landmark  recognition,  and 
3)  map  site  verification  and  landmark  position  update  on  the  map.  At  each 
stage,  different  sets  of  features  arc  used. 

Detection  is  a  focus-of-auention  stage;  it  occurs  at  ranges,  say,  greater 
than  35m.  Very  few  details  of  landmarks  (such  as  structure)  can  be 
observed;  only  dominant  characteristics  can  be  observed,  such  as  color, 
size,  elongation,  straight  lines,  etc.  From  the  map  knowledge  base,  spatial 
information  can  be  extracted,  such  as  position  ol  the  landmarks  with 
respect  to  the  road  (left  or  right)  and  position  (in  a  2-D  image)  with  respect 
to  each  other  (above,  below,  or  between).  So,  using  spatial  knowledge 
abstracted  in  terms  of  spatial  models  and  some  dominant  feature  models, 
landmarks  can  be  detected,  but  not  recognized  with  a  relatively  high  degree 
of  confidence.  However,  this  varies  from  one  landmark  to  another;  because 
some  landmarks  are  larger  than  others,  it  may  be  possible  to  recognize 
them  at  such  distances. 

The  second  step,  landmark  recognition,  occurs  at  closer  ranges,  say,  35 
to  10m.  At  these  ranges,  most  objects  show  more  details  and  structure. 
Segmentation  is  more  reliable,  which  makes  it  possible  to  extract  lines  and 
vertices.  This  in  turn  makes  it  possible  to  use  detailed  geometric  models 
based  on  representations,  such  as  generalized  cylinders,  wire  frames,  and 
winged  edge,  depending  on  the  landmarks.  Nevertheless,  feature-  and 
spatial-based  information  is  still  used  prior  to  matching  the  geometric 
model  to  image  content,  because  it  greatly  reduces  the  search  space.  We 
should  note  here  that  the  feature  and  spatial  models  used  in  the  first  step  are 
updated,  because  obviously  the  landmarks  are  perceived  differently  in  the 
2-D  image  at  short  ranges. 
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The  third  step  is  a  verification  stage  that  occurs  at  very  close  ranges. 
At  this  stage,  PREACTE  confirms  or  denies  the  existence  of  the  landmarks 
and  the  map  site  location  to  the  vehicle.  Since  subparts  can  be  identified  at 
close  ranges  for  some  landmarks,  semantic  models  can  be  used  to  produce 
a  higher  degree  of  confidence  in  the  recognition  process.  Some  landmarks 
may  partly  disappear  from  the  field  of  view  (FOV)  at  this  range.  This 
information  about  the  potential  disappearance  of  objects  from  the  FOV  is 
obtained  from  the  3-D  model  of  the  landmark,  the  camera  model,  and  the 
range. 

Recognition  plans  are  explicitly  stated  in  the  landmark  model  for 
different  ranges,  as  shown  below: 

(defvar  yellow-gate 
(make-instance  object 

:name  'yellow -gate 

:parts  (list  y-g- west-wing  y-g-east-wing) 

:geo-location  '(392967.4  1050687.7) 

:plan  '((40  15  detection)  (15  8  recognition)  (8  0 

verification)) 

detection  '(color) 

:  recognition  '(color  length  width  area  p2_over_area  shape) 

[verification  '(color  length  width  area  p2_over_area  shape 

lines)  )) 

Once  the  FOAA  for  a  landmark  is  determined  from  the  predicted 
model,  all  regions  from  the  segmented  image  are  matched  against  the 
landmark.  More  details  on  this  matching  technique  can  be  found  in  Nasr 
and  Bhanu.5-6 

We  compute  the  uncertainty  Us  at  each  map  site  location  in  the 
following  manner: 
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where  Us  is  the  uncertainty  at  site  s,  Us.  j  is  the  uncertainty  at  the  previous 
site,  L  is  the  average  accumulated  error  or  uncertainty  per  kilometer  of  the 
vehicle  navigation  system,  a  is  the  number  of  kilometers  traveled  between 
the  previous  and  the  current  site,  and  E(lj)s  is  the  evidence  accumulated 
about  landmark  lj  at  site  s.  Us  has  a  minimum  value  of  zero,  which 
indicates  the  lowest  uncertainty  and  is  the  value  at  the  starting  point.  The 
upper  limit  of  Us  can  be  controlled  by  a  threshold  value  and  a 
normalization  factor. 


4.  RESULTS 


PREACTE  and  DMM  were  tested  on  a  number  of  images  collected  by 
the  vehicle.  The  PREACTE  system  was  implemented  on  the  Symbolics 
3670.  The  image  processing  software  was  implemented  in  C  on  the  V  AX 
1 17750,  and  the  image  data  were  collected  at  30  frames/sec.  In  this  test,  the 
robot  started  at  map  site  105  and  headed  south  at  10  km/hr  (see  Figure  5). 
The  objective  of  the  test  was  to  predict  and  recognize  landmarks  that  were 
close  to  the  road  over  a  sequence  of  frames.  Figures  6  through  20  show 
different  stages  of  landmark  recognition  at  different  map  locations.  The 
figures  show  dynamic  models  generated  by  PREACTE  at  varying  ranges. 
They  also  show  how  PREACTE  changes  recognition  strategies  at  different 
stages  of  detection,  recognition,  and  verification.  In  addition,  the  figures 
show  the  computed  site  uncertainty  at  each  stage.  The  site  uncertainty 
fluctuates  depending  on  the  landmark  recognition  results  and  the  distance 
the  vehicle  has  traveled. 

In  the  future,  we  will  extend  this  approach  to  the  general  situation  in 
which  the  robot  may  be  traveling  through  terrain  and  must  determine 
precisely  where  it  is  on  the  map  by  using  landmark  recognition. 
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Figure  1 .  Detailed  conceptual  approach  of  PREACTE  and  DMM. 


TIM  •  T 

ari  ■  mi 
moan  •  « 


edits 


Figure  2.  A  graphic  illustration  of  PREACTE's  landmark 
recognition  and  map/landmark  representation. 


Figure  3.  2-D  projections  from  different  view  angles  and  ranges. 


Figure  6.  Site  105  is  the  next  predicted  map  site.  It 

contains  a  gate  and  a  telephone  pole.  PREACTE 
projects  a  road  model  of  the  scene  and  an  image 
is  processed  and  segmented. 
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Figure  7.  A  2-D  model  of  the  expected  site  is  generated 
at  the  predicted  range,  and  matching  occurs. 
The  “PART  TO  MATCH"  pane  shows 
descriptions  of  specific  landmark  parts  as 
matching  occurs.  The  intensity  feature  is 
emphasized  in  the  lower  left  comer. 
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Figure  8.  End  of  detection  stage,  with  site  uncertainty  computed. 


Figure  9.  Beginning  of  recognition  stage.  New  images  are 
processed  and  a  road  model  is  projected. 


Figure  10.  A  new  2-D  model  of  the  scene  is  generated. 

More  gate  parts  are  identified.  The  rectangles 
over  the  segmented  image  indicate  the  FOAAs. 
The  new  model  emphasizes  a  different  set  of 
features:  intensity,  length  to  width  ratio,  and  a 
shape  measure. 


Figure  11.  Site  uncertainty  has  decreased  because  of  additional 
positive  evidences  about  the  landm-"-k. 
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Figure  12.  End  of  recogmuon  stage  and  beginning  of 
verification  stage. 


Figure  13.  Site  uncertainty  is  computed  at  this  verification  stage. 
The  uncertainty  has  slightly  increased  because  of 
higher  matching  requirements. 


Figure  14,  The  vehicle  arrives  at  a  new  site,  which  contains 
a  yellow  gate. 


Figure  18.  The  gale  at  closer  range,  still  at  a  recognition  stage. 
Matching  results  have  degraded  because  of  bad 
segmentation  results.  The  "Matching  Regions" 
pane  shows  the  candidate  regions  for  a  given  part 
of  the  gate. 
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Figure  19.  The  vehicle  arrives  at  another  site  that  contains 
another  gate. 


Figure  20.  The  gate  model  at  very  close  range.  The  uncertainly 
value  is  still  at  an  acceptable  range.  DMM  predicts 
that  some  gate  parts  arc  outside  the  field  of  view  and 
avoids  matching  them. 
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ABSTRACT 

Current  target  recognition  systems  are  unable  to  modify 
their  behavior  based  on  the  dynamic  environmental  changes 
which  occur  around  them.  In  order  to  perform  robustly  in 
unconstrained,  outdoor  environments,  a  target  recognition 
system  must  be  able  to  adapt  its  representation  of  this 
dynamic  environment  while  maintaining  acceptable  perfor¬ 
mance  levels.  Machine  learning  technology  offers  some 
promising  solutions  to  many  of  these  problems  faced  in  the 
target  recognition  scenario.  Learning  allows  a  system  to  use 
situation  context,  to  adapt  its  representation  of  the  changing 
environment,  and  to  improve  the  system’s  recognition  perfor¬ 
mance  over  time.  This  paper  describes  an  innovative  system 
which  combines  learning  and  target  recognition  into  an 
integrated  system  which  accomplishes  the  above  tasks.  This 
system  is  called  TRIPLE:  Target  Recognition  Incorporating 
Positive  Learning  Expertise.  It  uses  two  machine  learning 
techniques  known  as  explanation-based  learning  and  concep¬ 
tual  clustering,  combined  with  a  knowledge- based  reasoning 
system,  to  provide  robust  target  recognition.  A  complete 
description  of  the  TRIPLE  system,  as  well  as  a  simple  exam¬ 
ple  showing  the  system’s  behavior,  is  presented. 


1.  INTRODUCTION 

Target  recognition  systems  currently  lack  the  ability  to 
adapt  to  changing  environmental  conditions  and  to  modify 
their  behavior  based  on  the  context  of  the  situation  in  which 
they  are  operating.  In  order  to  be  effective  in  dynamic  out¬ 
door  scenarios,  a  robust  recognition  system  should  be  able  to 
automatically  acquire  necessary  contextual  information  from 
the  environment.  Most  target  recognition  systems  lack  this 
capability.  Their  performance  begins  to  quickly  degrade 
when  subjected  to  problems  such  as  variable  lighting  condi¬ 
tions,  image  noise,  and  object  occlusion. 

Due  to  recent  advances  in  machine  learning  technology, 
some  of  the  problems  encountered  in  the  target  recognition 
domain  seem  to  be  resolvable.  Learning  allows  an  intelligent 
recognition  system  to  use  situation  context  in  order  to  under¬ 
stand  images.  This  context,  as  defined  in  a  machine  learning 
scenario,  consists  of  a  collected  body  of  background 
knowledge  as  well  as  environmental  observations  which  may 
impact  the  processing  of  the  scene.  The  vision-learning  cycle 
of  an  advanced  target  recognition  system  would  involve  the 
following  steps: 

Sense  -  acquire  an  image  and  apply  initial  image  pro¬ 
cessing  algorithms. 


Understand  -  using  present  background  knowledge  and 
scene  observations,  determine  those  objects  of  interest  in 
the  image. 

Act  -  based  on  the  objects  found,  act  according  to  glo¬ 
bal  system  goals. 

Update  -  modify  system  knowledge  using  image  obser¬ 
vations  so  performance  will  improve  next  time. 

This  cycle  repeats  each  time  the  recognition  system  is 
required  to  produce  results.  Because  the  system  dynamically 
reacts  to  the  appropriate  stimuli  in  the  environment,  it  con¬ 
tinuously  adapts  its  internal  knowledge  to  maintain  acceptable 
performance  levels. 

In  addition  to  the  vision-learning  cycle  described  above, 
other  desirable  features  to  be  incorporated  in  an  advanced  tar¬ 
get  recognition  system  are:  (a)  the  models  used  by  the  system 
to  represent  targets,  contexts,  and  other  system  knowledge 
should  be  dynamic  data  structures.  This  specification  allows 
the  learning  component  to  quickly  modify  the  behavior  of  the 
system  by  changing  the  data  on  which  it  operates;  (b)  most 
data  should  be  of  a  symbolic,  qualitative  nature,  thus  avoid¬ 
ing  the  problems  encountered  in  dealing  with  quantitative 
information.  Using  qualitative  information,  we  do  not  have 
to  rely  on  obtaining  precise  geometric  representations  of  tar¬ 
get;  (c)  the  system  has  to  be  able  to  handle  problems  such  as 
imprecise  segmentation,  occlusion,  noise  etc.  Since 
advanced  target  recognition  systems  are  required  to  operate  in 
real  world  situations,  they  must  be  capable  of  handling  these 
image  problems  that  will  inevitably  occur;  (d)  the  system 
should  exhibit  improved  performance  over  time.  This 
improvement  may  come  in  the  form  of  faster  recognition 
times,  improved  recognition  accuracy,  and  higher  confidence 
in  system  results. 

Machine  learning  will  facilitate  two  main  breakthroughs 
in  the  target  recognition  domain:  automatic  knowledge  base 
acquisition  and  continuous  knowledge  base  refinement.  The 
use  of  learning  in  the  knowledge  base  construction  will  save 
the  user  from  spending  the  enormous  amount  of  time  neces¬ 
sary  to  derive  target  models  and  databases.  Knowledge  base 
refinement  can  then  be  incorporated  to  make  any  necessary 
changes  to  improve  the  performance  of  the  recognition  sys¬ 
tem.  These  two  modifications  alone  will  serve  to 
significantly  advance  the  present  abilities  of  most  target 
recognition  applications. 

To  validate  the  concept  of  a  target  recognition  system 
with  integrated  machine  learning  capabilities,  this  paper 
presents  an  overview  of  a  new  approach  to  target  recognition. 
The  system  currently  under  implementation  is  called 
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TRIPLE:  Target  Recognition  Incorporating  Positive  Learning 
Expertise.  The  system  uses  a  multi-strategy  technique;  two 
powerful  learning  methodologies  are  combined  with  a 
knowledge-based  matching  technique  to  provide  robust  target 
recognition.  Using  dynamic  models,  TRIPLE  can  recognize 
targets  present  in  the  database.  If  necessary,  the  models  can 
be  refined  if  errors  are  found  in  the  representation.  Addition¬ 
ally,  TRIPLE  can  automatically  store  a  new  target  model  and 
recall  it  when  >hat  target  is  encountered  again.  Finally,  since 
TRIPLE  uses  qualitative  data  structures  to  represent  targets,  it 
can  overcome  problems  such  as  image  noise  and  occlusion. 

The  two  main  learning  components  of  the  TRIPLE  sys¬ 
tem  are  Explanation-Based  Learning  (EBL)  and  Structured 
Conceptual  Clustering  (SCC).  Explanation-based  learning 
provides  the  ability  to  build  a  generalized  description  of  a 
target  class  using  only  one  example  of  that  class.  Structured 
conceptual  clustering  allows  the  recognition  system  to  clas¬ 
sify  a  target  based  on  simple,  conceptual  descriptions  rather 
than  using  a  predetermined,  numerical  measure  of  similarity. 
While  neither  method,  used  separately,  would  provide  sub¬ 
stantial  benefits  to  a  target  recognition  system,  they  can  be 
combined  to  offer  a  consolidated  technique  which  employs 
the  best  features  of  each  method. 

The  remainder  of  this  paper  will  provide  the  reader  with 
most  of  the  details  of  the  TRIPLE  system.  To  fully  explain 
some  of  the  machine  learning  techniques  which  are  being 
utilized,  section  2  presents  a  brief  overview  of  machine 
learning,  including  the  explanation-based  learning  and  con¬ 
ceptual  clustering  methods.  Following  the  coverage  of  these 
techniques,  section  3  oudines  the  approach  to  target  recogni¬ 
tion  used  in  TRIPLE.  Section  4  provides  an  example  of 
TRIPLE’s  abilities  using  automobiles  as  the  targets  to  be 
recognized.  Finally,  section  5  presents  the  conclusions  of  the 
research  to  date  and  discusses  future  plans  for  the  current 
system. 


2.  MACHINE  LEARNING  TECHNIQUES 

The  ability  to  reason  and  the  ability  to  learn  are  the  two 
major  capabilities  associated  with  an  intelligent  being. 
Today,  machines  have  been  given  the  ability  to  reason  by 
developing  algorithms  which  duplicate,  to  a  limited  extent,  a 
reasoning  process.  Unfortunately,  machines  do  not  have  the 
ability  to  modify  or  update  this  reasoning  process  if  the  situa¬ 
tion  so  dictates.  Nor  do  they  have  the  ability  to  learn  new 
concepts  which  are  encountered  in  the  processing  of  informa¬ 
tion.  The  lack  of  both  of  these  abilities  has  severely  limited 
the  effectiveness  of  computers  to  work  in  unconstrained 
environments. 

Fortunately,  this  situation  is  slowly  beginning  to  change. 
Progress  is  being  made  in  the  development  of  algorithms 
which  can  modify  their  behavior  as  the  environment  in  which 
they  operate  changes.  Machine  learning  now  exists  as  a 
valid  portion  of  the  AI  and  psychology  fields,  has  dedicated 
conferences  and  journals,  and  is  being  added  as  a  feature  in  a 
broad  variety  of  computer  science  applications.  While  many 
of  these  additions  are  very  superficial  and  do  little  to  improve 
the  performance  of  the  system  to  which  they  are  added, 
research  has  led  to  fundamental  advances  in  several  areas  of 
machine  learning  technology. 

Advances  in  the  machine  learning  field  originated  in  the 
late  50’s.  Rosenblatt12  developed  the  perceptron,  a  self- 
organizing  system,  although  it  never  achieved  the  success 
anticipated  by  many  of  its  advocates.  Following  this  effort, 
researchers  concentrated  on  tasks  such  as  learning  from 
examples  and  language  acquisition  which  relied  on  significant 


amounts  of  data  and  time  to  search  the  problem  space.  Also 
known  as  inductive  learning  systems,  these  learning  from 
example  techniques  were  applied  in  a  wide  range  of  applica¬ 
tion  domains.  The  most  influential  research  performed  in  the 
area  of  inductive  learning  was  accomplished  by  Winston.16 
His  structural  learning  system  formed  concept  descriptions 
from  a  set  of  carefully  selected  examples  of  the  concept  as 
well  as  "near  misses".  Near  misses  represent  concepts  that 
are  similar  to  the  one  being  learned,  differing  only  in  a  small 
number  of  very  significant  details.  Positive  examples  serve 
to  generalize  the  concept  while  near  misses  provide  the 
necessary  amount  of  specificity.  Many  inductive  learning 
systems  followed  the  early  work  done  by  Winston.  Dietter- 
ich  and  Michalski5  present  a  good  comparison  of  various  sys¬ 
tems  which  incorporate  learning  from  examples. 

In  most  systems  which  utilize  inductive  learning,  a 
method  known  as  generalization  is  used  to  extract  the  com¬ 
mon  features  which  characterize  a  group  of  objects.  General¬ 
ization  has  been  used  in  various  AI  contexts  for  many  years, 
although  the  results  have  been  difficult  to  compare  due  to 
substantial  differences  in  implementation  and  domain  of 
application.  Mitchell10  casts  the  generalization  problem  into 
a  search  framework  and  compares  various  approaches  to  the 
problem.  The  search  space  consists  of  the  possible  generali¬ 
zations  that  can  be  constructed  for  a  given  problem. 
Methods  of  generalization  can  then  be  characterized  by  a 
search  strategy  such  as  depth-first,  breadth-first,  or  version 
space  technique. 

Connell  and  Brady2  developed  a  system  which  learns 
the  descriptions  of  two-dimensional  objects  including  aerial 
views  of  airplanes  or  silhouette  images  of  various  hand  tools. 
This  technique  produced  structured  production  rules  which 
were  used  to  recognize  subsequent  instances  of  similar 
objects.  Using  inductive  generalization  techniques  which 
allow  for  disjunctions,  Connell  and  Brady’s  method  was  one 
of  the  first  systems  which  could  learn  from  real  image  data. 

The  trend  in  machine  learning  has  been  to  incorporate 
techniques  which  can  derive  the  maximum  amount  of  infor¬ 
mation  from  single  examples,  using  analytic  methods  rather 
than  empirical  ones.  Current  research  is  now  directed  at 
developing  programs  which  provide  learning  from  observa¬ 
tion  and  discovery.  Explanation-base  learning  (EBL)  and 
structured  conceptual  clustering  (SCC),  both  of  which  are 
used  in  the  TRIPLE  system,  are  examples  of  learning  metho¬ 
dologies  which  employ  a  high  level  of  inference.  EBL, 
classified  as  a  learning  by  observation  technique,  uses  infer¬ 
ence  to  construct  a  useful  concept  description  from  a  single 
example  of  that  concept.  SCC,  which  is  also  a  learning  from 
observation  method,  employs  an  even  higher  level  of  infer¬ 
ence  since  it  does  not  rely  at  all  on  any  user  input  to  classify 
a  group  of  targets  into  conceptually  simple  groups.  These 
techniques  will  now  be  discussed. 

2.1  Explanation-Based  Learning 

Most  of  the  early  systems  which  utilized  learning  from 
examples  were  able  to  achieve  impressive  results  compared 
to  methods  which  did  not  use  any  form  of  machine  learning 
at  all.  However,  it  was  discovered  that  the  user  may  find  it 
difficult  or  impossible  to  provide  the  learning  mechanism 
with  enough  examples  to  properly  generalize  the  concept 
description.  Additionally,  the  system  was  unable  to  justify 
the  generalization  which  was  produced  from  a  set  of  exam¬ 
ples;  the  user  could  not  obtain  a  description  of  how  the 
objects  had  been  generalized.  These  factors  led  to  the 
development  of  learning  systems  which,  using  applicable 
background  knowledge,  could  generate  a  concept  description 


from  a  single,  user-provided  example.  At  the  same  time,  the 
system  also  created  an  explanation  as  to  why  the  example 
yielded  that  particular  generalization.  Initially  referred  to  as 
explanation-based  generalization  (EBG),  this  technique  is 
now  commonly  called  explanation-based  learning  (EBL). 

The  generalization  process  employed  by  EBL  can  be 
viewed  as  a  search  through  the  possible  concept  description 
space.  The  objective  is  to  locate  the  correct  definition  of  the 
concept  being  learned.  To  constrain  the  size  of  this  search 
space,  EBL  relies  on  knowledge  of  the  p-oblem  domain. 
Since  the  extra  information  present  in  a  set  of  multiple  exam¬ 
ples  is  not  provided,  EBL  must  use  some  other  type  c 
knowledge  to  sufficiently  generalize  the  single  example.  This 
information  exists  in  the  form  of  relevant  background 
knowledge  which  is  given  to  the  system.  Using  the  back¬ 
ground  knowledge,  EBL  is  able  to  produce  a  valid  generali¬ 
zation  of  the  single  example.  Additionally,  it  creates  a 
justification  of  the  generalization  in  terms  of  the  background 
knowledge  used  to  produce  that  generalization.  This 
justification  is  called  the  explanation  of  the  concept  example. 

The  origin  of  the  explanation-based  approach  to 
machine  learning  can  be  traced  back  to  the  STRIPS  system 
developed  by  Fikes  et  al ,6  which  learns  generalized  robot 
path  planning  motions.  From  this  initial  work,  the  creation 
of  a  concept  description  from  a  single  example  was  then  for¬ 
malized  by  DeJong.3  In  this  paper,  he  introduces  the  term 
explanation-based  generalization.  As  DeJong  continued 
working  on  his  system,  others  began  work  on  their  own 
extensions  or  changes  to  the  initial  EBG  method.  The  most 
prominent  of  these  was  the  research  done  by  Mitchell,  Keller, 
and  Kedar-Cabelli. 1 1  They  proposed  a  standardized  approach 
to  explanation-based  generalization.  This  technique  creates 
an  explanation  structure,  represented  as  a  proof  tree,  which 
serves  as  the  generalization  of  the  concept.  The  generaliza¬ 
tion  is  a  two  step  process.  The  first  step  forms  the  explana¬ 
tion  that  separates  the  relevant  and  irrelevant  feature  values 
present  in  the  training  example.  Second,  the  explanation  is 
analyzed  to  determine  the  constraints  (numeric  values, 
numeric  ranges,  or  enumerated  values)  on  the  feature  values 
which  will  allow  the  explanation  to  apply  in  general. 

In  response  to  perceived  inadequacies  in  the  work  by 
Mitchell  et  al.,  DeJong  and  Mooney4  proposed  further  revi¬ 
sions  to  the  EBG  system.  DeJong  and  Mooney  felt  that  the 
term  explanation-based  learning  was  more  complete  than 
explanation-based  generalization  since  the  approach  seemed 
to  be  applicable  to  both  concept  refinement  and  concept  gen¬ 
eralization.  This  version  of  EBL  serves  as  the  basis  for  the 
target  model  creation  and  refinement  component  of  the  TRI¬ 
PLE  system  described  in  section  3.  While  Mitchell  et  al.' s 
version  of  EBG  produced  the  object  generalization  in  a  two 
step  process,  the  EBL  technique  simultaneously  forms  an 
explanation  of  the  training  example  and  builds  the  general¬ 
ized  concept  of  the  training  example.  In  addition,  EBL  is 
capable  of  specializing  a  previously-defined,  over-generalized 
object  concept.  This  refinement  ability  is  very  valuable  since 
it  provides  a  partial  solution  to  the  problem  of  generalizing 
non-independent  conjunctive  sub-goals.  In  other  words,  after 
several  passes  over  a  concept  description  which  may  contain 
conflicting  information,  EBL  is  capable  of  properly  represent¬ 
ing  this  concept,  while  EBG  would  have  failed.  Since  this 
effectively  provides  a  form  of  explanation-based  specializa¬ 
tion  as  well  as  explanation-based  generalization,  DeJong  and 
Mooney  have  aptly  named  the  method  explanation-based 
learning. 


2.2  Structured  Conceptual  Clustering 

Classification  of  similar  objects  has  traditionally  been 
accomplished  using  mathematical  techniques  such  as  numeri¬ 
cal  taxonomy  and  cluster  analysis.1,14,1*  Using  a  pre-defined 
set  of  object  features  or  attributes,  these  techniques  would 
compute  clusters  of  objects;  clusters  are  characterized  by  high 
intra-class  similarity  and  low  inter-class  similarity.  Cluster¬ 
ing  methods  are  generally  unable  to  identify  groups  of 
objects  which  represent  conceptually  simple  concepts  since 
they  rely  on  numerical  measures  of  similarity.  In  addition, 
the  results  usuallv  must  be  interpreted  by  expert  data  analysts 
to  decipher  the  classification  results. 

Problems  with  numerical  clustering  techniques  and  the 
proposed  solutions  have  been  numerous.  However,  they  do 
not  address  some  of  the  fundamental  problems  inherent  with 
the  clustering  methods.  First,  numerical  clustering  techniques 
are  context  free.  They  make  no  use  of  any  contextual  or 
background  information  while  computing  object  similarities. 
Psychological  tests  have  shown  that  humans  make  use  of 
significant  amounts  of  context  when  classifying  objects. 
Second,  most  methods  are  unable  to  expand  the  feature  space 
in  order  to  discover  new  features  which  may  yield  ideal 
classifications.  Simple,  linear  combinations  of  object  attri¬ 
butes  can  often  be  used  to  locate  intrinsic  groups  of  data. 
Third,  the  techniques  do  not  have  the  ability  to  select  and 
evaluate  object  attributes  when  generating  clusters.  The  attri¬ 
butes  are  simply  used  to  compute  distances  between  neigh¬ 
boring  objects  and  clusters.  Finally,  the  classification  results 
still  have  to  be  interpreted  because  a  characterization  of  the 
clusters  is  not  produced. 

The  problems  mentioned  above  have  caused  researchers 
to  design  systems  which  try  to  model  the  classification  tech¬ 
niques  used  by  humans.  People  normally  group  objects  using 
a  conjunction  of  attributes  which  represent  conceptually  sim¬ 
ple  ideas.  At  the  same  time,  they  also  consider  the  context  in 
which  the  objects  act,  which  often  determines  important 
features  which  can  be  used  in  the  classification  task.  Using  a 
collection  of  background  knowledge  to  provide  context, 
Michalski7  developed  a  new  version  of  clustering  which 
identified  groups  of  objects  which  represent  the  same  type  of 
conceptually  simple  ideas  that  humans  tend  to  use.  This 
approach  to  classification  is  known  as  conceptual  clustering. 
Since  it  does  not  rely  on  a  teacher  to  preclassify  the  objects, 
conceptual  clustering  is  superior  to  earlier  systems  which  use 
learning  from  examples. 

An  implementation  of  the  technique  Michalski 
developed  is  presented  in  a  paper  by  Michalski  and  Stepp.'1 
This  method,  known  as  the  CLUSTER/2  program,  constructs 
a  classification  of  objects  only  if  a  given  class  can  be 
specified  by  a  conjunctive  concept  which  uses  selected  object 
attributes.  Quality  measures  such  as  the  fit  between  the  clus¬ 
tering  and  the  observed  events,  the  inter-cluster  distance,  total 
number  of  features  used  in  the  concept  description,  and  the 
number  of  features  which  individually  discriminate  among  all 
the  clusters  have  been  used  to  judge  the  quality  of  the 
selected  object  attributes. 

To  validate  the  performance  of  the  CLUSTER/2  algo¬ 
rithm  and  compare  its  performance  with  classical  clustering 
approaches,  Michalski  and  Stepp8  tested  the  classification 
ability  with  18  other  numerical  taxonomy  methods.  Only  4 
of  the  18  numerical  methods  were  able  to  produce  the  con¬ 
ceptually  appealing  results  of  the  CLUSTER/2  program. 
These  results  show  that  conceptual  clustering  achieves  many 
of  the  classification  goals  used  by  humans. 

As  a  further  extension  to  their  work,  Stepp  and  Michal- 
ski13  constructed  a  new  conceptual  clustering  system  which 


incorporates  three  main  changes  from  the  previous  technique: 
objects  are  complex  and  require  structural  descriptions; 
relevant  attributes  may  not  be  initially  provided  and  should 
be  dynamically  determined  in  that  case;  and  rules  of  infer¬ 
ence  are  used  to  derive  useful  high-level  concepts  from  the 
initial  low-level  information.  This  new  version  of  conceptual 
clustering  is  called  CLUSTER/S.  To  produce  valid 
classifications,  the  system  is  provided  with  a  general  goal  of 
classification.  Using  the  supplied  goal,  the  system  then  refer¬ 
ences  the  collection  of  background  knowledge  to  determine 
the  relevant  attributes  and  features  useful  for  clustering. 

The  background  knowledge  is  organized  into  a  network 
structure  called  a  Goal  Dependency  Network  (GDN).  The 
information  present  in  the  GDN  can  represent  general- 
purpose  knowledge  as  well  as  domain-specific  knowledge, 
both  of  which  are  necessary  in  the  problem  solving  process. 
General-purpose  knowledge  is  made  up  of  fundamental  con¬ 
straints  and  criteria  which  specify  the  general  properties  of 
classification.  The  domain-specific  information  contains 
inference  rules  for  deriving  new  descriptors  and  rules  for 
determining  which  descriptors  will  be  relevant.  Given  a  high 
level  goal  of  classification,  the  GDN  specifies  the  related 
sub-goals  and  any  associated  object  attributes  which  are 
relevant  at  that  level  in  the  network.  If  the  relevant  descrip¬ 
tors  are  not  present,  the  background  knowledge  is  used  to 
derive  new  descriptors  which  are  applicable.  The  Goal 
Dependency  Network  plays  a  vital  role  in  the  construction  of 
meaningful  classifications. 

The  ability  of  CLUSTER/S  to  handle  complex,  struc¬ 
tural  objects  relies  on  the  information  present  in  the  GDN. 
Using  the  goal  of  classification,  the  system  is  able  to  deter¬ 
mine  which  structural  elements  of  the  object  are  most  useful 
in  selecting  the  appropriate  classification  scheme.  The 
advantages  of  this  approach  include: 

(1)  Ability  to  handle  compound  objects  which  require 
structural  descriptions  in  order  to  be  easily  classified. 

(2)  Use  of  goal-directed  inference  from  information  pro¬ 
vided  by  the  GDN. 

(3)  Formulation  of  new  object  attributes  using  the  GDN. 

(4)  Allows  the  system  to  be  model  or  data  driven. 

The  ability  to  select  appropriate  classification  features  and  to 
generate  new  attributes  when  necessary  places  the  conceptual 
clustering  technique  into  the  area  of  machine  learning 
referred  to  earlier  as  learning  from  observation. 

In  the  domain  of  target  recognition,  most  complex  tar¬ 
gets  are  represented  as  a  structured  collection  of  sub-parts. 
The  ability  to  cluster  such  descriptions  is  very  useful  in  this 
application.  In  the  remainder  of  this  paper,  Stepp  and 
Michalski’s  CLUSTER/S  system  will  be  referred  to  as  Struc¬ 
tured  Conceptual  Clustering  (SCC)  because  of  the  ability  to 
handle  these  structural  descriptions. 


3.  TRIPLE  TARGET  RECOGNITION  SYSTEM 

Although  machine  learning  has  been  used  in  many 
applications,  very  little  effort  has  been  made  to  combine 
several  learning  techniques  together.  Learning  methodologies 
are  used  independently  to  provide  adaptive  ability  and 
improved  system  performance.  However,  TRIPLE  incor¬ 
porates  explanation-based  learning  and  conceptual  clustering 
into  a  multi-strategy  learning  approach  to  target  recognition. 
By  utilizing  the  capabilities  of  each  learning  method  at 
appropriate  steps  in  the  recognition  and  learning  process, 
TRIPLE  overcomes  the  inherent  limitations  present  in  these 
learning  techniques.  EBL’s  main  limitation  is  the  matching 


Figure  1:  The  TRIPLE  target  recognition  system 


time  required  when  the  number  of  target  models  becomes 
large.  SCC’s  has  problems  with  model  biases  when  the 
number  of  object  class  examples  is  small.  Combining  the 
ability  of  EBL  to  characterize  a  target  using  a  single  training 
example  with  SCC’s  efficient  method  of  organizing  objects 
once  they  have  been  properly  modeled  yields  an  integrated 
learning  system  which  handles  the  target  recognition  task. 

Figure  1  shows  the  various  components  of  the  TRIPLE 
target  recognition  system.  Basically,  the  characterization 
abilities  of  explanation-based  learning  are  combined  with  the 
aggregation  capabilities  of  the  conceptual  clustering  tech¬ 
nique.  The  EBL  component  is  used  to  create  and  refine  tar¬ 
get  models  while  the  SCC  component  is  used  to  structure  the 
EBL-generated  models  into  an  efficient  classification  tree. 
The  domain  rules  and  facts  relevant  to  the  recognition  prob¬ 
lem  are  stored  in  the  background  knowledge  database.  The 
goal  and  sub-goal  information  is  stored  in  a  modified  version 
of  a  Goal  Dependency  Network  (GDN)  originally  used  by 
conceptual  clustering.  The  GDN  has  been  adapted  so  that 
the  EBL  component  can  access  the  necessary  information. 


The  TRIPLE  system,  as  indicated  in  Figure  1,  is  composed  of 
six  main  components: 

(1)  A  system  training  set  which  initializes  the  target 
recognition  process. 

(2)  A  target  database,  factual  knowledge,  and  a  goal 
dependency  network.  These  items  are  combined  to 
form  the  background  knowledge  pertaining  to  the 
selected  recognition  domain. 

(3)  An  explanation-based  learning  component  which 
provides  generalized  target  descriptions. 

(4)  A  structured  conceptual  clustering  system  which 
arranges  the  current  set  of  target  descriptions  into  a 
classification  tree. 

(5)  A  model  matching  component  which  uses  the  exist¬ 
ing  classification  tree  to  recognize  an  unknown  tar¬ 
get  description. 

(6)  A  segmentation  and  feature  extraction  component 
which  processes  the  images  and  provides  the  neces¬ 
sary  object  features  and  relationships. 

Each  of  these  elements  will  be  described  in  detail  in  the  fol¬ 
lowing  sub-sections,  followed  by  a  description  of  the 
recognition-learning  cycle  used  by  the  TRIPLE  system. 

3.1  System  Training  Set 

The  initial  input  to  the  TRIPLE  system  consists  of  a  set 
of  target  examples  and  a  collection  of  background  knowledge 
pertinent  to  those  targets.  For  each  target  which  will  be 
recognized  by  TRIPLE,  the  user  must  supply  a  set  of  1  or 
more  representative  examples  for  that  target  class.  Each  set 
of  examples  will  be  analyzed  by  the  EBL  system  and  a  gen¬ 
eralized  target  description  will  be  created.  Note  that  the 
input  data  for  each  target  class  need  only  consist  of  a  single 
example  since  EBL  is  capable  of  producing  a  correct  general¬ 
ization  from  only  one  target.  However,  as  will  be  discussed 
in  section  3.3,  EBL  will  be  able  to  generalize  a  single  con¬ 
ceptual  description  if  provided  with  more  than  one  example. 

In  addition  to  the  examples  for  each  target  which  will 
be  recognized,  the  user  must  also  provide  the  initial  descrip¬ 
tion  of  the  goal  dependency  network  and  any  factual 
knowledge  which  may  be  necessary  when  the  learning  sys¬ 
tems  attempt  to  make  inferences  on  the  target  models.  The 
GDN  contains  the  goal  hierarchy  necessary  for  the  recogni¬ 
tion  and  learning  systems  to  derive  the  applicable  targei  attri¬ 
butes  when  recognizing  new  targets,  refining  existing  models, 
or  processing  incomplete  data. 

3.2  Target  Models,  Factual  Database,  and  Goal  Depen¬ 
dency  Network 

The  collection  of  background  knowledge  used  by  the 
TRIPLE  system  is  composed  of  three  main  groups  of  data: 
target  models,  factual  data,  and  a  goal  dependency  network. 
The  target  models,  which  are  originally  built  by  the  EBL 
component,  are  stored  separately  in  the  target  model  database 
in  case  they  are  needed  for  refinement  purposes  later. 
Although  the  target  model  information  will  also  be 
represented  in  the  classification  tree,  the  actual  schema  which 
defines  each  model  is  kept  in  the  database  for  use  by  the 
EBL  system  when  characterizing  a  target  class.  The  features 
and  relationships  which  are  labeled  as  relevant  by  the  EBL 
component  are  tagged  in  the  schema  and  will  be  used  in  the 
SCC  system.  The  models  are  dynamically  modified  by  the 
recognition  and  learning  elements  when  missing  or  incorrect 
attributes  and  relationships  are  discovered. 


The  factual  knowledge  originally  given  by  the  user  is 
maintained  in  the  fact  database  This  knowledge  consists  of 
a  set  of  rules  and  facts  which  define  the  domain  in  which  the 
recognition  process  is  operating.  For  example,  if  the  recogni¬ 
tion  domain  is  automobiles,  the  factual  knowledge  may  con¬ 
tain  information  regarding  plausible  locations,  orientations, 
uses,  and  other  contextual  data  for  different  type  of  vehicles 
on  the  road.  This  information  is  used  in  conjunction  with  the 
learned  target  models  by  the  EBL,  SCC,  and  recognition 
components.  As  the  various  components  attempt  to  apply 
this  information,  inconsistencies  or  gaps  may  be  found. 
These  problems  will  propagate  the  assertion  of  new  facts  and 
the  retraction  of  incorrect  or  useless  facts.  Thus,  the  factual 
database  will  be  as  dynamic  as  the  models  which  depend  on 
this  data. 

The  modified  version  of  the  GDN  in  the  TRIPLE  sys¬ 
tem  is  primarily  used  by  the  EBL  component  when  determin¬ 
ing  attribute  relevancy.  Although  it  was  originally  designed 
to  be  used  by  the  SCC  system,  the  GDN  information  is  used 
in  the  EBL  phase  of  the  recognition  loop.  The  version  of  the 
GDN  used  in  the  EBL  phase  contains  two  high  level  goals: 
creation  of  a  new  target  model  and  refinement  of  an  existing 
target  model.  The  selection  of  the  appropriate  goal  is 
described  in  the  next  section.  Using  the  goal  which  is 
received,  the  GDN  accesses  the  goal-subgoal  hierarchy  and 
selects  the  attributes  and  relationships  which  are  useful  in 
characterizing  the  particular  target.  This  alternative  applica¬ 
tion  of  the  GDN  information  is  possible  since  the  main 
responsibility  of  the  GDN  is  to  provide  the  SCC  component 
with  a  set  of  relevant  target  features;  this  task  is  now  accom¬ 
plished  by  the  EBL  system. 

3.3  Explanation-Based  Learning 

The  explanation-based  learning  component  of  the  TRI¬ 
PLE  system  is  responsible  for  two  critical  functions: 

(a)  Processing  the  training  examples  provided  by  the 
user. 

(b)  Explaining  target  recognition  failures  to  improve 
system  performance. 

Each  of  these  tasks  will  now  be  described  in  detail. 

(a)  Processing  training  examples: 

The  first  step  in  the  target  recognition  process  is  to 
acquire  and  represent  a  set  of  target  models.  This  job  has 
traditionally  been  very  difficult  due  to  the  amount  of  work 
necessary  to  generate  a  correct  model.  The  TRIPLE  system 
uses  the  power  of  EBL  to  simplify  the  modeling  process. 
Since  EBL  can  generalize  a  target  description  from  a  single 
example,  the  user  merely  provides  a  set  of  target  attributes 
and  relationships  in  the  training  set.  From  this  data,  EBL 
selects  the  relevant  attributes  and  relationships  using  the 
modified  version  of  the  GDN.  Other  applicable  background 
knowledge  which  provides  useful  data  transformations  is  used 
here  as  well.  These  transformations  are  used  to  infer  high- 
level  target  attributes  from  the  various  input  attributes. 

The  schema  which  is  used  to  store  this  new  target  will 
still  retain  the  entire  list  of  attributes  and  relationships  which 
were  originally  provided  to  the  EBL  system.  The  relevant 
attributes  which  have  been  selected  will  be  tagged  in  order  to 
separate  them  from  the  rest  of  the  attributes.  The  remaining 
attributes  are  retained  in  case  model  refinement  is  ever 
needed.  Some  attributes  will  be  indicated  in  the  GDN  as 
statistical  attributes  that  the  recognition  component  uses  to 
provide  another  adaptive  capability  to  the  recognition  system. 
If  any  of  the  attributes  are  statistical,  they  will  be  initialized 
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at  this  time.  This  process  will  be  more  fully  described  in 
section  3.5. 


Since  all  training  examples  are  processed  sequentially 
before  the  target  recognition  process  begins,  the  training 
phase  also  insures  that  the  target  generalizations  are  not 
identical.  This  situation  can  result  from  a  collection  of  back¬ 
ground  knowledge  which  is  not  diverse  enough  to  distinguish 
the  various  training  examples.  Alternatively,  the  set  of  target 
attributes  selected  by  the  user  may  not  be  broad  enough  to 
individually  characterize  each  target  class.  In  either  case,  the 
EBL  system  performs  a  simple  comparison  against  all  exist¬ 
ing  schemata  during  the  training  phase.  If  an  identical  match 
occurs,  the  user  will  be  notified  by  the  TRIPLE  system  and 
must  then  supply  the  additional  data  in  order  to  separate 
those  target  classes.  After  the  training  phase,  this  problem 
will  not  occur  again  because  the  unknown  targets  are  first 
checked  against  all  existing  models  before  the  learning  loop 
is  ever  entered.  Thus,  it  is  impossible  for  EBL  to  create  the 
same  generalization  twice,  except  during  this  stage. 


An  additional  extension  to  the  standard  EBL  technique 
is  the  capability  to  generate  a  single  explanation  from  more 
than  one  representative  example.  This  ability  is  useful  when 
the  user  is  trying  to  characterize  a  target  model  that  may  not 
be  sufficiendy  captured  in  a  single  training  example,  such  as 
when  modeling  a  3D  target.  Using  only  a  single  example  in 
this  situation  could  bias  the  resulting  generalization  towards 
that  particular  example.  In  order  to  prevent  the  possibility  of 
this  bias,  the  EBL  component  of  the  TRIPLE  system  allows 
the  user  to  create  a  single  target  model  with  up  to  3  examples 
of  that  target.  There  are  two  ways  to  accomplish  this  goal. 
First,  each  of  the  examples  can  be  individually  generalized 
and  these  generalizations  can  be  intersected  to  provide  a  sin¬ 
gle  model.  Alternatively,  the  similarities  of  each  example 
can  be  located  and  used  to  provide  a  generalization.  The  first 
method  is  preferable  since  it  is  easier  to  intersect  the  general¬ 
izations  due  to  the  smaller  number  of  remaining  attributes 
and  relationships.  Multiple  examples  will  hopefully  eliminate 
any  biases  which  may  be  implied  from  using  only  a  single 
example. 


Using  the  above  modifications  to  the  standard  EBL  tech¬ 
nique,  the  target  model  training  task  has  been  effectively 
automated  in  the  TRIPLE  system.  After  characterizing  each 
of  the  targets,  a  copy  of  the  schema  will  be  stored  in  the 
background  knowledge  database  and  another  copy  will  be 
sent  to  the  SCC  component  for  integration  into  the 
classification  tree. 


(b)  Explaining  recognition  failures: 

The  main  task  of  the  EBL  component  is  explaining  the 
various  types  of  recognition  failures  that  occur.  EBL  must 
be  provided  with  two  pieces  of  information  in  order  to  prop¬ 
erly  explain  these  failures.  First,  the  system  must  have  the 
target  description  which  caused  the  recognition  failure. 
Second,  the  type  of  failure  which  resulted  from  this  target 
must  also  be  determined.  The  failure  type  is  used  as  the  goal 
concept  used  in  the  generalization  process.  Most  EBL  sys¬ 
tems  require  that  the  user  provide  the  goal  concept  from 
which  the  generalization  can  be  derived.  However,  the  TRI¬ 
PLE  system  does  not  require  the  user  to  provide  this  kind  of 
information.  Instead,  the  model  matching  component  deter¬ 
mines  the  appropriate  goal  concept  and  sends  it  to  the  EBL 
system.  The  goal  concepts  are  generated  when  the  recogni¬ 
tion  system  is  not  able  to  properly  recognize  a  new  target. 
TRIPLE’s  EBL  component  is  designed  to  process  two  main 
types  of  recognition  failures: 

(1)  Failures  caused  by  the  presence  of  a  new  target. 


(2)  Failures  due  to  an  incomplete  or  incorrect  target 
model. 


Failures  which  result  from  encountering  a  new  model  are 
handled  in  the  same  manner  as  when  processing  the  user- 
supplied  training  examples.  This  process  was  described  pre¬ 
viously  in  this  section. 


If  an  incorrect  model  causes  the  recognition  failure, 
EBL  can  explain  this  event  and  alter  the  model  to  avoid  the 
problem  in  the  future.  There  are  several  types  of  target 
model  errors  which  EBL  can  effectively  handle.  If  the 
current  model  contains  an  attribute  or  relationship  which  is 
not  relevant,  EBL  will  remove  the  relevant  attribute  or  rela¬ 
tionship  tag  from  that  feature  in  the  model  schema.  If  the 
model  contains  an  attribute  whose  value  is  not  useful,  EBL 
can  modify  the  value.  Finally,  if  a  target  model  is  not  using 
an  attribute  or  relationship  which  is  relevant,  it  can  be  tagged 
as  relevant  in  the  model  schema.  Since  these  types  of  recog¬ 
nition  failures  are  located  and  labeled  by  the  model  matching 
component,  EBL  can  minimize  the  search  space  when  trying 
to  locate  the  incorrect  data. 


3.4  Structured  Conceptual  Clustering 

Normally,  conceptual  clustering  applications  classify 
hundreds  of  examples  simultaneously,  many  of  which  belong 
to  the  same  class,  thus  requiring  the  use  of  a  similarity-based 
method  to  correctly  characterize  the  targets  present  in  a  given 
cluster.  The  similarity-based  method  performs  the  characteri¬ 
zation  task  defined  in  Section  4.  However,  in  the  TRIPLE 
system,  SCC  receives  only  one  pre-characterized  example 
which  represents  a  target  class  from  the  EBL  system.  The 
EBL  component  has  already  performed  the  task  usually  han¬ 
dled  by  the  GDN;  all  relevant  target  attributes  and  relation¬ 
ships  have  been  identified.  Since  SCC  relies  on  conceptual 
simplicity  as  a  quality  measure,  instead  of  numeric  quality 
values  which  depend  on  the  number  of  samples  in  a  cluster, 
it  can  produce  a  valid  clustering  which  contains  only  one 
sample  per  cluster. 

The  measure  of  simplicity  determines  how  precisely  the 
classification  tree  distinguishes  various  targets.  For  example, 
assume  the  target  recognition  domain  is  automobiles.  If  the 
measure  of  simplicity  is  very  coarse,  it  may  be  impossible  for 
the  system  to  separate  different  types  of  cars,  although  cars 
and  trucks  are  distinguishable.  As  the  measure  of  simplicity 
becomes  more  complex,  different  types  of  cars  can  be 
identified  (sedan,  coupe,  etc.). 

To  provide  greater  efficiency  in  the  conceptual  cluster¬ 
ing  process,  SCC  does  not  always  create  a  new  classification 
tree.  If  a  model  refinement  operation  has  been  selected,  only 
a  portion  of  the  classification  tree  has  to  be  changed.  By 
tracing  the  classification  tree  until  the  branch  at  which  the 
incorrect  feature  is  located,  SCC  can  merely  re-cluster  the 
rest  of  that  branch.  Since  target  models  will  be  isolated 
along  only  one  branch  of  the  tree,  there  is  no  danger  in  rear¬ 
ranging  individual  branches  of  the  tree. 

If  a  new  model  is  added  to  the  classification  tree,  the 
tree  will  have  to  be  reconstructed  since  new  attributes  may  be 
present  in  the  new  model  which  are  not  present  in  the  tree. 
In  addition,  the  new  model  may  interact  with  the  existing  tar¬ 
get  models  in  such  a  way  that  features  which  were  not  previ¬ 
ously  useful  as  important  classifiers  can  now  be  used  to  dis¬ 
tinguish  target  classes.  Thus,  the  conceptual  clustering  pro¬ 
cess  will  be  applied  to  the  entire  group  of  target  models, 
including  the  new  target  which  has  been  created.  The 
classification  tree  can  then  be  passed  to  the  model  matching 
component  for  use  in  identifying  unknown  targets. 
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3.5  Knowledge-Based  Model  Matching 

The  model  matching  component  of  the  TRIPLE  system 
uses  the  classification  tree  produced  by  SCC  in  order  to  iden¬ 
tify  unknown  targets.  The  model  matching  component  has 
access  to  the  knowledge  base  so  that  any  necessary  data 
transformations  or  inferences  can  be  made.  This  component 
has  two  functions  within  the  TRIPLE  system:  recognition  of 
targets  which  already  exist  in  the  model  database  and 
identification  of  the  two  types  of  recognition  failures  which 
cause  the  system  to  enter  the  learning  loop. 

When  an  unknown  target  is  received  by  the  recognition 
system,  it  is  first  checked  against  all  items  currently  in  the 
model  database.  This  matching  procedure  is  done  by  using 
the  classification  tree.  Since  the  model  schemata  are 
represented  in  a  search  tree  format,  the  recognition  process  is 
far  more  efficient  than  comparing  the  new  target  with  each 
model  individually.  Because  some  of  the  features  present  in 
a  target  model  may  be  missing  in  the  image  observations  due 
to  segmentation  problems  or  occlusion,  the  model  matching 
component  monitors  the  parsing  of  the  classification  tree  to 
decide  on  the  type  of  recognition  or  failure  which  has 
occurred.  Each  of  these  possibilities  will  now  be  briefly  dis¬ 
cussed. 

Complete  Matching:  If  the  model  matching  component  can 
successfully  traverse  the  classification  tree,  match  all  neces¬ 
sary  attribute  and  relationship  criteria,  and  arrive  at  a  leaf 
node  which  contains  a  target  model,  that  model  has  been 
correctly  matched.  During  the  parsing  of  this  tree,  the  recog¬ 
nition  element  has  some  flexibility  in  terms  of  slight  varia¬ 
tions  in  attribute  values  and  a  small  number  of  missing  attri¬ 
butes.  However,  if  any  attributes  are  missing,  they  must  be 
minor  in  nature;  global  target  features  can  not  be  safely 
ignored.  Confidence  in  a  successful  match  is  computed 
based  on  the  amount  of  variation  in  attribute  values  as  well 
as  the  significance  of  any  missing  attributes. 

Incomplete  Matching:  If  the  model  matching  component  can 
correctly  parse  the  high  level  nodes  of  the  classification  tree 
(these  nodes  specify  global  attributes  of  the  target  models) 
prior  to  exhausting  all  available  image  data,  all  targets 
present  in  any  leaf  nodes  at  the  end  of  that  branch  can  be 
hypothesized  as  valid  models  for  the  data  which  is  present. 
Matching  global  features  and  relationships  implies  that  the 
target  is  lacking  the  details  necessary  for  a  precise 
identification.  However,  this  information  can  guide  the  seg¬ 
mentation  and  feature  extraction  algorithm  which  can  be 
prompted  to  provide  additional  information  on  the  target 
being  identified.  If  no  more  feature  information  can  be 
obtained,  the  model  matching  component  provides  the  com¬ 
plete  set  of  possible  target  identifications. 

Occlusion:  If  the  recognition  process  is  not  able  to  match 
any  high  level  classification  attributes,  but  can  recognize 
many  low  level  attributes  and  relationships,  the  presence  of 
occlusion  is  very  probable.  In  this  case,  a  confidence  level 
can  be  computed  based  on  the  reliability  of  the  features 
which  lead  to  the  matching.  For  example,  if  the  global 
features  of  a  car  (length,  height,  etc.)  cannot  be  matched,  but 
the  shapes  of  the  hood,  bumper,  headlights,  and  fenders  are 
present,  the  recognition  system  can  use  this  information  to 
identify  the  target  as  a  car.  A  confidence  threshold  is  used  to 
insure  that  the  model  matching  component  does  not  match 
ridiculously  simple  features  and  report  a  valid  matching. 

Model  Refinement:  Model  refinement  is  determined  by  a 
process  which  is  very  similar  to  complete  recognition.  The 
parsing  of  the  classification  tree  should  proceed  without  miss¬ 
ing  more  than  a  few  target  attributes  or  relationships. 
However,  the  failure  is  encountered  when  comparing  the 
values  stored  in  the  attributes  themselves.  While  the  attribute 


itself  may  be  present  in  the  target  model,  the  values  in  the 
model  attributes  may  be  different  than  the  values  in  the 
image  attributes.  This  situation  implies  that  the  target  model 
contains  most  of  the  correct  attributes  but  needs  to  be  refined 
by  updating  the  values  present  in  the  attribute  variables.  In 
addition,  any  missing  or  extra  attributes  present  in  the  target 
model  can  be  inspected  during  the  refinement  process  to 
determine  whether  they  should  be  included  or  removed, 
respectively. 

New  Target  Model:  This  situation  is  encountered  when  the 
model  matching  component  can  not  apply  any  meaningful 
portions  of  the  classification  tree  to  the  data  received  from 
the  image.  The  image  data  is  passed  to  the  learning  loop  of 
the  TRIPLE  system  for  characterization  and  integration  into 
the  classification  tree.  If  the  data  contains  enough  high  level 
and  low  level  details  to  properly  characterize  a  new  target 
model,  it  will  be  processed  by  the  EBL  system  and  incor¬ 
porated  into  the  classification  tree.  Otherwise,  the  EBL  sys¬ 
tem  will  report  the  failure  to  understand  the  data  and  the 
object  will  be  classified  as  unknown. 


The  recognition  element  updates  the  statistical  variables 
for  the  given  target  model  if  and  when  that  model  is  used  to 
recognize  a  target.  This  process  allows  the  system  to  gradu¬ 
ally  determine  that  certain  features  have  a  higher  utility  since 
they  are  always  used  in  the  recognition  process.  In  addition, 
the  variable  values  of  certain  attributes  can  be  slowly 
modified  by  the  recognition  process  in  order  to  overcome  any 
initial  bias  that  may  have  been  derived  from  the  initial  con¬ 
struction  of  the  target  model.  This  factor  is  especially  useful 
for  those  models  which  are  added  by  the  learning  system  dur¬ 
ing  normal  operations.  While  the  user  is  responsible  for  pro¬ 
viding  examples  which  are  highly  representative  of  a  target 
class  during  the  training  phase,  target  models  acquired  during 
the  recognition  phase  of  the  system  may  be  subject  to  a  much 
higher  level  of  noise.  Thus,  the  model  matching  component 
can  slowly  adapt  these  models  as  more  targets  of  this  type 
are  recognized  and  the  system  determines  a  better  value  for 
many  of  the  attributes  which  characterize  that  target. 
Changes  in  attribute  values  made  by  the  model  matching 
component  will  be  very  gradual  compared  with  the  changes 
which  result  from  activating  the  model  refinement  process 
during  the  learning  process  of  the  TRIPLE  system. 

3.6  Segmentation  and  Feature  Extraction 

The  general  purpose  segmentation  and  feature  extraction 
component  of  the  TRIPLE  system  extracts  the  necessary  low 
level  features  and  relationships  from  the  image  data  which 
are  then  used  by  the  knowledge-based  model  matching 
component  in  the  target  recognition  process.  The  feature 
extraction  process  is  designed  to  locate  the  most  prominent 
features  in  the  image  data  and  to  determine  the  significant 
relationships  between  these  features.  The  features  of  interest 
in  the  image  include  regions,  edges,  comers,  arcs,  and 
ellipses. 

Once  the  primitive  features  have  been  located,  the 
feature  extraction  process  then  considers  the  spatial  relation¬ 
ships  between  various  features.  If  a  predefined  orientation  of 
a  set  of  features  is  located,  the  feature  extraction  process 
hypothesizes  the  presence  of  a  high  level  symbolic  feature. 
For  example,  in  the  domain  of  automobile  recognition,  if  a 
pair  of  concentric  ellipses  is  found  in  the  image,  the  feature 
extraction  component  will  hypothesize  the  presence  of  a  tire 
and  wheel.  This  symbolic  feature  can  then  be  used  in  con¬ 
junction  with  other  symbolic  features  in  the  model  matching 
process  described  in  the  previous  section. 

In  addition  to  providing  initial  feature  and  relational 
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information,  which  is  a  data  driven  process,  the  feature 
extraction  component  can  also  be  used  in  a  model  driven 
manner  by  the  model  matching  component.  Daring  the  pars¬ 
ing  of  the  classification  tree,  the  model  matching  procedure 
may  encounter  the  presence  of  a  symbolic  feature  in  the  tree 
which  is  not  present  in  the  initial  symbolic  feature  list  pro¬ 
vided  by  the  feature  extraction  component.  The  model 
matching  process  can  request  the  feature  extraction  procedure 
to  reanalyze  a  specific  portion  of  the  image  in  search  of  that 
particular  feature.  The  parameter  sets  and  thresholds  used  in 
this  instance  can  be  more  relaxed  than  during  the  initial 
image  processing  since  only  a  portion  of  the  image  is  being 
processed  and  the  presence  of  a  specific  feature  is  being 
sought  If  after  several  attempts  to  find  the  feature,  the 
extraction  process  is  not  successful,  the  feature  extraction 
component  will  abandon  the  search  and  the  model  matching 
process  will  be  informed  of  the  failure.  Otherwise,  the  pres¬ 
ence  and  location  of  the  desired  feature  will  be  returned  to 
the  model  matching  component. 

3.7  Recognition-Learning  Process 

The  vision-learning  cycle  of  the  TRIPLE  system  con¬ 
sists  of  three  main  steps:  model  creation/refinement,  target 
aggregation,  and  target  recognition.  A  brief  review  of  each 
of  these  stages  as  well  as  how  they  interact  in  the  target 
recognition  process  in  provided  in  the  rest  ot  this  section. 

The  user  provides  an  initial  collection  of  background 
knowledge  and  a  set  of  training  examples  which  represent  the 
targets  needed  for  recognition  purposes.  The  domain  back¬ 
ground  knowledge  is  stored  for  use  by  all  components  of  the 
TRIPLE  system.  The  sets  of  target  examples  are  given  to  the 
EBL  system  for  characterization.  During  the  training  phase, 
the  EBL  component  must  insure  that  the  target  models  which 
are  created  are  unique.  If  they  fail  to  meet  this  requirement, 
the  user  should  be  notified  and  must  then  provide  additional 
background  knowledge  or  more  detailed  target  examples. 
The  model  schemata  are  stored  in  the  knowledge  base  for 
subsequent  use.  In  addition,  a  copy  of  each  schema  is  then 
sent  to  SCC  for  inclusion  in  the  new  system  classification 
tree.  Once  SCC  has  received  all  the  models  from  the  train¬ 
ing  set,  it  creates  the  initial  classification  tree.  Finally,  the 
tree  is  sent  to  the  model  matching  component  for  recognition 
of  unknown  targets. 

After  the  initial  training  has  been  completed,  the  recog¬ 
nition  component  uses  the  classification  tree  to  recognize 
each  target  processed  by  the  TRIPLE  system.  If  a  target  has 
already  been  modeled,  it  will  be  recognized,  the  confidence 
will  be  computed,  and  this  information  will  be  output  to  the 
user.  The  types  of  recognition  produced  by  the  TRIPLE  sys¬ 
tem  include  complete  recognition,  incomplete  recognition, 
and  recognition  in  occluded  scenes. 

If  the  model  exists,  but  needs  to  be  refined,  the  recogni¬ 
tion  component  will  send  the  image  data  to  the  EBL  system 
which  will  identify  the  incorrect  data,  update  the  model,  and 
send  this  information  to  the  SCC  system.  SCC  will  then 
make  the  necessary  modifications  to  the  classification  tree 
and  return  the  updated  tree  back  to  the  recognition  element, 
where  this  process  begins  again. 

If  the  model  does  not  exist,  the  image  data  is  sent  to  the 
EBL  system  for  construction  of  a  new  model.  This  pro¬ 
cedure  is  subject  to  verification  of  the  usefulness  of  the 
observed  image  data.  Once  the  EBL  system  has  built  the 
new  model  schema,  it  is  passed  to  the  SCC  component.  In 
this  case,  the  entire  classification  tree  is  rebuilt,  in  case  the 
new  model  contains  attributes  which  may  lead  to  a  more 
efficient  tree  representation.  As  before,  the  tree  is  then  sent 


to  the  model  matching  component  for  use  in  subsequent 
identification  activities. 


4.  EXAMPLE  -  VEHICLE  RECOGNITION 

In  order  to  fully  illustrate  the  interactions  among  the 
components  of  the  TRIPLE  system,  this  section  provides  a 
simple  example  using  military  vehicles  as  objects.  The  vehi¬ 
cles  which  have  been  chosen  are  examples  of  Soviet  vehicles 
which  would  likely  be  encountered  by  an  autonomous  vehicle 
during  a  reconnaissance  or  surveillance  mission.  This  group 
includes  armoured  personnel  carriers,  reconnaissance  vehi¬ 
cles,  and  cargo  transport  vehicles.  Complex  objects  such  as 
tanks  and  self-propelled  guns  have  been  avoided  to  simplify 
this  example.  In  addition,  the  vehicles  which  have  been 
selected  will  be  treated  as  2D  objects  in  terms  of  the  features 
which  are  used  to  classify  them.  However,  in  practice,  the 
TRIPLE  system  is  being  implemented  using  a  3D  object 
modeling  system  which  incorporates  salient  object  features 
and  relationships  between  those  features. 

Figure  2  shows  the  system  training  set  for  the  selected 
Soviet  vehicles  used  in  this  example.  The  set  includes  an 
armoured  personnel  carrier  (BTR-70),  a  reconnaissance  vehi¬ 
cle  (BRDM-1),  and  a  cargo  truck  (MAZ-502).  The  feature 
set  defined  for  this  example  are  simple  features  such  as 
object  length,  height,  wheelbase,  number  of  wheels,  and  so 
on.  Each  of  these  features  can  be  reliably  extracted  from  an 
image  containing  that  target.  Note  that  not  all  features  are 
defined  on  all  objects;  the  cargo  truck  does  not  have  an  arma¬ 
ment  feature  but  does  include  a  load  height  feature  which 
specifies  the  height  of  the  cargo  platform. 

The  EBL  component  of  the  TRIPLE  system  takes  the 
above  training  set  and  selects  the  relevant  target  features 
using  the  background  knowledge  for  this  particular  target 
domain.  In  this  example,  the  rules  may  contain  information 
such  as: 

Vehicle  length,  height,  and  wheelbase  are 

always  used  in  characterizing  a  target. 

If  a  vehicle  has  armaments,  they  are  useful 

in  characterizing  that  target. 

The  set  of  relevant  features  for  the  Soviet  vehicles  is  shown 
in  Figure  3.  Since  the  armoured  personnel  carrier  and  the 
reconnaissance  vehicle  have  armaments,  that  feature  has  been 
included  in  the  target  models  in  addition  to  length,  height, 
and  wheelbase.  As  mentioned  in  section  3,  the  remaining 
target  attributes  are  retained  in  the  target  model  database  in 
case  the  system  needs  to  use  them  in  the  future. 

Once  the  EBL  characterization  for  each  target  has  been 
obtained,  the  SCC  component  creates  the  classification  tree 
which  will  be  used  by  the  knowledge-based  matching  com¬ 
ponent  during  the  subsequent  recognition  procedure.  SCC 
segregates  the  targets  based  on  the  conceptual  simplicity  of 
the  target  classes.  The  classification  tree  for  this  example  is 
indicated  in  Figure  4.  Because  the  cargo  truck  contains  no 
armaments,  it  is  conceptually  different  from  the  other  two 
vehicles  and  is  placed  in  a  separate  branch  of  the 
classification  tree.  In  the  right  branch  of  the  tree,  the  length 
feature  of  those  targets  is  determined  to  be  the  most  useful  in 
distinguishing  between  the  recon  vehicle  and  the  personnel 
carrier.  The  resulting  classification  tree  is  sent  to  the  match¬ 
ing  component  for  use  in  recognizing  these  three  targets. 

The  matching  component  continues  to  use  the 
classification  tree  until  it  encounters  a  target  which  cannot 
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Figure  3:  EBL-selected  relevant  target  features 
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Figure  2:  System  training  set  for  Soviet  vehicles 
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Figure  4:  SCC-generated  target  classification  tree 
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be  processed  by  the  tree  information.  This  situation  was 
fully  described  in  section  3.5.  When  this  failure  occurs,  the 
unknown  target  is  sent  to  the  EBL  component  for  complete 
characterization.  This  event  signals  the  beginning  of  the 
learning  cycle.  Figure  5  shows  a  target  which  cannot  be  pro¬ 
cessed  by  the  classification  tree  due  to  differences  in  the  each 
of  the  three  target  features.  The  vehicle  is  assumed  to  be  a 
new  target  and  the  feature  list  is  sent  to  the  EBL  system. 

Figure  6  displays  the  results  of  the  relevant  feature  selection 
by  the  EBL  component.  In  the  process  of  selecting  relevant 
features,  EBL  has  determined  that  the  number  of  wheels  is  a 
useful  feature  when  distinguishing  the  new  vehicle  from  the 
MAZ-520  cargo  truck  which  is  also  present.  This  feature  is 
then  marked  in  the  MAZ-520  feature  list  as  relevant.  The 
resulting  classification  tree  produced  by  the  SCC  component 
is  shown  in  Figure  7.  The  number-of-wheels  feature  is  now 
used  in  the  left  branch  of  the  tree  to  distinguish  between  the 
MAZ-520  cargo  truck  and  the  new  unknown  vehicle. 

This  simple  example  has  shown  how  the  TRIPLE  sys¬ 
tem  can  achieve  target  model  acquisition  and  model 
refinement  by  explaining  feature  significance  and  using  those 
features  to  efficiently  recognize  future  instances  of  a  target 

Figure  5:  Unknown  vehicle  which  triggers  the  learning  cycle 
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Figure  6:  EBL-selected  relevant  target  features 
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Figure  7:  New  SCC-generated  target  classification  tree 
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5.  CONCLUSIONS  AND  FUTURE  WORK 

We  have  shown  the  usefulness  of  integrating  machine 
learning  with  target  recognition  to  create  a  system  which 
adapts  its  representation  of  the  target  domain  in  order  to 
operate  effectively  in  an  unconstrained,  outdoor  environment. 
Present  target  recognition  systems  do  not  have  this  capabil¬ 
ity.17  Using  the  characterization  ability  of  explanation-based 
learning  and  the  efficient  classification  techniques  of  concep¬ 
tual  clustering,  the  TRIPLE  system  provides  automatic 
knowledge-base  acquisition  and  refinement.  The  system  is 
robust  in  several  respects:  it  can  acquire  a  target  model 
using  a  single  example  of  that  target;  it  uses  only  the  most 
relevant  features  during  subsequent  recognition  stages;  and 
target  model  refinements  are  facilitated  through  the  use  of  the 
EBL  component. 

Presendy,  we  are  working  on  an  implementation  which 
recognizes  complex,  3D  targets  such  as  cars,  trucks,  vans, 
etc.  This  system  uses  3D  object  models  which  consist  of 
object  features  and  relationships  among  the  features.  In  addi¬ 
tion,  we  will  evaluate  the  performance  of  the  TRIPLE  system 
when  subjected  to  problems  such  as  occlusion,  image  noise, 
etc. 
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ABSTRACT 

The  practical  recovery  of  quantitative  structural  informa¬ 
tion  about  the  world  from  visual  data  has  proven  to  be  a  very 
difficult  task.  In  particular,  the  recovery  of  motion  informa¬ 
tion  which  is  sufficiently  accurate  to  allow  practical  application 
of  theoretical  shape  from  motion  results  has  so  far  been 
infeasible.  Yet  a  large  body  of  evidence  suggests  that  use  of 
motion  is  an  extremely  important  process  in  biological  vision 
systems.  It  has  been  suggested  by  Thompson  [Thom86]  that 
qualitative  visual  measurements  can  provide  powerful  percep¬ 
tual  cues,  and  that  practical  operations  can  be  performed  on 
the  basis  of  such  clues  without  the  need  for  a  quantitative 
reconstruction  of  the  world.  The  use  of  such  information  is 
termed  “inexact  vision”.  This  paper  describes  the  investiga¬ 
tion  of  one  such  approach  to  the  analysis  of  visual  motion. 
Specifically,  the  use  of  certain  measures  of  flow  field  divergence 
were  investigated  as  a  qualitative  cue  for  obstacle  avoidance 
during  visual  navigation.  It  is  shown  that  a  quantity  termed 
the  directional  divergence  of  the  2-D  motion  field  can  be  used 
as  a  reliable  indicator  of  the  presence  of  obstacles  in  the  visual 
field  of  an  observer  undergoing  generalized  rotational  and 
translational  motion.  Moreover,  the  necessary  measurements 
can  be  robustly  obtained  from  real  image  sequences.  A  simple 
differential  procedure  for  robustly  extracting  divergence  infor¬ 
mation  from  image  sequences  which  can  be  performed  using  a 
highly  parallel,  connectionist  architecture  is  described.  The 
procedure  is  based  on  the  twin  principles  of  directional  separa¬ 
tion  of  optical  flow  components  and  temporal  accumulation  of 
information.  Experimental  resuits  are  presented  showing  that 
the  system  responds  as  expected  to  divergence  in  real  world 
image  sequences,  and  the  use  of  the  system  to  navigate 
between  obstacles  is  demonstrated. 


I.  INTRODUCTION 

Navigation  is  a  basic  operation  in  automation  which  must 
be  performed  by  any  system  that  manipulates  or  responds  to 
physical  objects  within  its  environment.  A  robot  assembly 
arm,  for  instance,  must  move  without  accidentally  striking  the 
workpiece  or  any  other  obstacle,  and  at  a  higher  level,  must  be 
able  to  locate  and  acquire  tools  or  subassemblies.  In  existing 
industrial  systems,  such  navigation  is  typically  performed  by 
following  a  previously  determined  path  either  in  absolute,  or 
more  flexibly,  in  relative  coordinates.  This  is  more  or  less  the 
computer  equivalent  of  a  cam.  Such  methods  have  the  advan- 
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tage  of  being  efficient  with  regard  to  the  use  of  computational 
resources,  and  are  relatively  straightforward  to  program  and 
analyze.  On  the  negative  side,  preprogrammed  solutions  are 
inflexible  with  respect  to  variations  in  the  task,  and  intolerant 
of  changes  in  the  workspace.  A  computer-driven  welder  fol¬ 
lowing  a  preset  path  does  not  do  a  good  job  if  the  workpiece  is 
half  an  inch  from  the  anticipated  position.  There  is  thus  con¬ 
siderable  incentive  for  developing  systems  capable  of  perform¬ 
ing  active  navigation  by  using  sensory  feedback  to  direct 
movement. 

Visual  navigation  has  received  particular  attention  in  this 
respect,  primarily  because  of  its  potential,  as  manifested  in 
biological  organisms,  to  provide  a  large  amount  of  information 
about  many  different  characteristics  of  the  environment. 
Animals  use  vision  to  obtain  information  about  location, 
shape,  movement,  and  identity  of  objects.  Designers  of 
automatic  systems  would  like  to  be  able  to  design  artificial  sys¬ 
tems  with  similar  capabilities;  however,  doing  so  has  generally 
proved  to  be  extremely  difficult. 

One  reason  for  this  difficulty  is  that  extracting  visual 
information  from  images  is  simply  a  lot  of  work.  The  visual 
cortexes  of  animals  that  perform  complex  visually  moderated 
behaviors  contain  millions  of  neurons,  each  of  which  performs 
computations  which  by  the  lowest  estimates  require  thousands 
of  computer  steps  per  second  to  simulate,  and  possibly  much 
more.  Much  of  this  capacity  is  probably  necessary  to  carry 
out  whatever  cortical  image  processing  occurs. 

We  feel,  however,  that  another  reason  for  the  limited  suc¬ 
cess  in  many  efforts  is  that  the  fundamental  approach  is,  in  a 
certain  sense,  overambitious.  There  is  a  natural  tendency  to 
attempt  to  reduce  multiple  visual  tasks  to  some  lowest  com¬ 
mon  denominator.  Thus,  for  example,  the  problem  of  visual 
navigation  has  often  been  considered  a  subproblem  of  the 
“shape  from  x”  problem,  and  is  mentioned  as  an  application  of 
the  more  general  studies  designed  to  extract  accurate,  quanti¬ 
tative  information  about  the  three  dimensional  structure  of  the 
world.  However,  despite  numerous  elegant  theoretical  results, 
none  of  the  shape  from  x  studies  have  yielded  systems  which 
can  actually  be  used  to  navigate  in  a  real  environment.  The 
usual  situation  is  that  a  mathematical  solution  to  the  general 
problem  can  be  obtained  only  under  specific  assumptions 
which  are  not  particularly  valid  in  the  real  word.  On  the 
other  hand,  many  practical  applications  have  been  so  con¬ 
cerned  with  performing  one  narrowly  defined  operation  as  to 
render  them  devoid  of  useful  general  principles. 


There  is,  however,  another  approach.  Rather  than 
obtaining  a  specific  solution  to  a  general  problem,  e.g.,  solving 
the  shape  from  flow  problem  assuming  piecewise  smoothness 
and  exact  information,  one  may  seek  a  general  solution  to  a 
specific  problem,  e.g.,  solving  the  visual  obstacle  avoidance 
problem  with  no  specific  restrictions  on  the  shape  of  the 
environment.  If  this  approach  is  taken,  then  qualitative  meas¬ 
urements  which  are  specific  to  the  problem  at  hand,  but  which 
can  be  obtained  practically  in  real-world  environments,  can  be 
used.  This  is  the  approach  taken  in  this  paper. 

The  above  strategy  recalls  the  program  proposed  by  Rod¬ 
ney  Brooks  [Broo86]  for  achieving  artificial  intelligence  through 
building  robots.  The  claim  is  made  that  understanding  of 
intelligent  behavior  can  be  approached  by  attempting  to  con¬ 
struct  working  mechanisms  which  duplicate  progressively  more 
sophisticated  abilities  displayed  by  living  animals.  Thus,  for 
example,  before  one  is  able  to  navigate,  one  must  be  able  to 
wander  without  hitting  things.  Before  one  is  able  to  wander, 
one  must  be  able  to  walk.  Before  one  is  able  to  walk,  one 
must  be  able  to  stand,  and  so  forth.  A  computational  theory 
of  some  ability  should  invoke  no  "lower  level”  process  for 
which  understanding  has  not  been  demonstrated  by  an  ability 
to  build  a  mechanism  which  actually  performs  the  process. 
Any  theory  which  does  not  conform  to  this  requirement  is  on 
very  shaky  ground  at  best.  The  basic  premise  here  is  that  to 
be  believable,  a  theory  of  machine  vision  must  be  demon¬ 
strated  by  constructing  a  mechanism  which  implements  it. 
Given  the  many  plausible  sounding  theories  of  vision  which 
have  turned  out  to  be  either  unimplementable  or  so  incomplete 
as  to  be  unusable  in  a  practical  sense,  there  are  good  grounds 
for  maintaining  some  such  position. 

One  of  the  most  elementary  forms  of  navigation  is  obsta¬ 
cle  avoidance  by  a  moving,  compact  sensor.  It  is  a  prere¬ 
quisite,  however,  for  many  more  complex  abilities  since  any 
system  performing  a  more  complicated  task  must  avoid  obsta¬ 
cles  in  the  process.  Obstacle  avoidance  is  thus  one  specific 
problem  for  which  a  general  solution  is  highly  desirable.  In 
this  context,  a  general  solution  refers  to  a  system  that  works 
effectively  in  a  wide  range  of  real  environments.  This  implies, 
among  other  things,  that  the  system  performance  does  not 
depend  upon  artificial  constraints  on  the  nature  of  objects  in 
the  environment  such  as  assuming  planar  or  smoothly  curved 
surfaces,  rigid  or  unmoving  objects,  mathematically  uniform 
textures,  and  so  forth.  For  reasons  to  be  discussed,  we  believe 
that  qualitative  motion  measurements  are  particularly 
appropriate  for  this  task. 

The  use  of  motion  information  in  natural  image  process¬ 
ing  systems  is  both  widespread  and  fundamental.  The  visual 
systems  of  many  simple  organisms  are  essentially  blind  to 
fixated  images.  Even  in  the  human  visual  system,  if  the  sac¬ 
cadic  movement  of  the  eyes  is  suppressed  or  canceled,  the  per¬ 
ceived  image  fades  from  view.  In  what  might  be  considered  a 
hierarchy  of  natural  vision  systems,  from  light-sensitive  spots 
responsible  for  phototropic  behavior  in  some  single  celled 
organisms  to  those  of  the  higher  vertebrates,  the  simplest  that 
display  capabilities  which  would  be  useful  in  the  design  of 
automated  systems  appear  at  the  level  of  the  flying  insects, 
which  perform  a  wide  variety  of  complex,  visually  moderated 
behaviors,  including  flight  navigation  in  three  dimensional 
environments,  pursuit  and  capture  of  prey  (e.g.,  dragonflies), 
and  location  and  identification  of  food  sources  (e.g.,  honey¬ 
bees).  Available  evidence  suggests  that  the  insect  visual  sys¬ 
tem  responds  primarily  to  moving  images,  and  that  such  infor¬ 


mation  is  critical  in  orientation  and  navigation  [Reic76|.  These 
systems,  the  simplest  capable  of  visual  navigation  and  obstacle 
avoidance,  appear  to  rely  almost  completely  on  motion  derived 
information.  There  is  thus  a  biological  motivation  for  investi¬ 
gating  the  role  of  movement  in  visual  navigation. 

In  this  paper,  we  explore  the  use  of  a  particular  motion 
cue,  namely,  flow  field  divergence,  which  corresponds  to  the 
intuitive  notion  of  image  expansion.  This  is  a  relatively  simple 
cue  which  can  be  derived  locally  and  in  parallel  from  fine¬ 
grained  time  sliced  image  sequences,  i.e,  those  in  which  the 
motion  from  one  frame  to  the  next  is  generally  on  the  order  of, 
or  less  than,  one  pixel.  Motion  information  is  accumulated 
over  time  from  local  differential  relations,  without  establishing 
any  correspondence,  and  without  use  of  regularization  pro¬ 
cedures.  It  turns  out  that  the  flow  information  thus  obtained 
is  sufficiently  accurate  to  be  useful  in  navigation. 

Our  approach,  and  the  organization  of  this  paper,  reflect 
all  three  levels  in  the  paradigm  introduced  by  Marr  [Marr82], 
namely,  computational  theory,  algorithms,  and  implementa¬ 
tion.  Sections  II  and  III  present  the  theoretical  basis  for  our 
use  of  divergence  measurements.  Section  IV  describes  the 
design  of  a  practical  obstacle  avoidance  system  at  the  algo¬ 
rithmic  level.  Section  V  describes  the  implementation  of  the 
system,  and  the  results  of  experiments  run  using  the  imple¬ 
mentation. 


H.  INTRODUCTION  TO  MOTION  ANALYSIS 


A  camera  moving  within  a  three  dimensional  environment 
produces  a  time-varying  image  which  can  be  characterized  at 
any  time  t  by  a  two  dimensional  vector-valued  function  / 
known  as  the  motion  field.  The  motion  field  describes  the 
two  dimensional  projection  of  the  three  dimensional  motion  of 
scene  points  relative  to  the  camera.  Mathematically,  the 
motion  field  is  defined  as  follows.  For  any  point  (x,y)  in  the 
image,  there  corresponds  at  time  t  a  three  dimensional  scene 
point  (x',y',z')  whose  projection  it  is.  At  time  t+At,  the 
world  point  (z'.y'.z')  projects  to  the  image  point 
(x+Ax,y 4-Ay).  The  flow  field  at  {x,y)  at  time  t  is  given  by 


f(x,y,t)=  liml^,^) 
y  ’  AI-O'  At  At' 


Figure  1  shows  motion  fields  for  several  situations. 


The  motion  field  depends  on  the  motion  of  the  camera, 
the  three  dimensional  structure  of  the  environment,  and  the 
three  dimensional  motion  (if  any)  of  objects  in  the  environ¬ 
ment.  If  all  these  components  are  known,  then  it  is  relatively 
straightforward  to  calculate  the  motion  field.  From  a  com¬ 
puter  vision  standpoint,  the  interesting  question. is  whether  the 
process  can  be  inverted  to  obtain  information  about  camera 
motion  and  structure  of  the  environment.  This  is  not  so  easy, 
and  if  arbitrary  shapes  and  motions  are  permitted  in  the 
environment,  there  may  not  be  a  unique  solution.  However,  it 
can  be  mathematically  demonstrated  that,  in  most  situations, 
a  unique  solution  exists. 


The  existence  of  such  solutions  has  inspired  a  large  body 
of  work  detailing  the  mathematical  theory  of  extracting  shape 
and/or  motion  information  from  the  motion  field.  Some 
approaches  utilize  flow  information  at  separated  locations 
under  the  assumption  of  three  dimensional  igidity  of  the 
environment  [Ullm79],  Others  use  the  flow  and  its  derivatives 
in  a  local  neighborhood  under  some  assumption  about  the 
structure  of  environmental  surfaces  (e.g.,  they  are  planar). 
Several  authors  have  recently  obtained  solutions  to  the  general 


549 


shape  from  flow  problem  in  closed  form  with  a  set  of  linearized 
equations  (Tsai84  Long81],  The  end  result  of  such  analytic 
approaches  is  a  set  of  equations  relating  the  flow  field  and/or 
its  spatial  derivatives  to  the  camera  motion  and  the  three- 
dimensional  structure  of  the  environment.  Most  of  these  stu¬ 
dies,  however,  have  started  with  the  assumption  that  detailed 
and  accurate  motion-field  information  is  available.  Unfor¬ 
tunately,  the  solutions  to  the  equations  are  frequently  inordi¬ 
nately  sensitive  to  small  errors  in  the  motion  field.  Tsai  and 
Huang  [Tsai84]  report  60%  error  for  a  1%  perturbation  in 
input  for  some  instances  using  their  method.  This  error  sensi¬ 
tivity  is  due  both  to  inherent  ambiguities  in  the  flows  pro¬ 
duced  by  certain  camera  motions,  at  least  over  restricted  fields 
of  view,  and  to  the  reliance  on  differentiation  of  the  flow  field 
which  amplifies  the  effect  of  any  error  present  in  the  data. 
Because  the  extraction  of  accurate  flow  fields  from  real  image 
sequences  has  proven  to  be  extremely  difficult  [Ullm81j,  such 
analytically  based  shape  from  motion  schemes  have  so  far 
found  little  practical  application.  Recent  theoretical  research 
has  addressed  the  development  of  noise  insensitive  solution 
procedures  (e.g.,  [Yasu85|)  and  methods  for  smoothing  incon¬ 
sistent  flow  information  extracted  from  image  sequences  (e.g., 
[Anan85|). 

Several  methods  have  been  proposed  for  extracting 
motion  information  from  image  sequences  These  methods  fall 
into  two  rough  categories:  those  based  on  time  and  space 
derivatives  of  the  images,  pioneered  by  Horn  and  Schunk 
[Horn81|,  and  matching  techniques  similar  to  those  employed 
in  stereo  vision  [Barn80]. 

The  correspondence  methods  attempt  to  identify 
corresponding  features  in  consecutive  image  frames,  and  com¬ 
pute  the  motion  field  from  the  positional  change.  This 
approach  has  two  problems;  first,  the  correspondence  problem 
is  itself  quite  difficult  since  features  may  change  from  one 
image  to  the  next,  and  even  appear  and  disappear  completely. 
Second,  since  each  correspondence  gives  the  flow  at  just  one 
point,  it  may  prove  impossible  to  find  enough  strong  features 
to  densely  determine  the  flow. 

Differential  methods,  on  the  other  hand,  attempt  to 
determine  the  flow  field  from  local  computations  of  the  spatial 
and  temporal  derivatives  of  the  gray  scale  image.  The  prob¬ 
lem  with  this  approach  is  that  the  local  apparent  motion  of 
the  image,  known  as  the  optical  flow,  does  not  necessarily 
correspond  to  the  2-D  motion  field.  The  most  obvious 
demonstrations  of  this  fact  are  pathological  examples.  For 
instance,  a  spinning,  featureless  sphere  under  constant  illumi¬ 
nation  has  zero  optical  flow,  but  a  non-zero  motion  field.  Con¬ 
versely,  a  stationary  sphere  under  changing  illumination  has 
non-zero  optical  flow,  but  zero  motion  field.  Verri  and  Poggio 
[Verr87]  have  shown  that  these  are  not  just  pathological  exam¬ 
ples  due  to  the  lack  of  features,  but  that  only  under  special 
conditions  of  lighting  an'*  movement  do  the  motion  field  and 
the  optical  flow  correspond  exactly.  They  also  show,  however, 
that  for  sufficiently  high  gradient  magnitude,  the  agreement 
can  be  made  arbitrarily  close  This  corresponds  to  the  intui¬ 
tion  that  for  strongly  textured  images  the  motion  field  and  the 
optical  flow  are  approximately  equal. 

A  more  serious  problem  with  differential  techniques  is 
that  only  the  component  of  the  optical  flow  parallel  to  the 
local  image  gradient  can  be  recovered  from  local  differential 
information.  This  is  commonly  referred  to  as  the 
aperture  problem.  Intuitively,  the  aperture  problem 
corresponds  to  the  fact  that  for  a  moving  edge,  only  the  com¬ 


ponent  of  motion  perpendicular  to  the  edge  can  be  determined. 
This  fact  is  responsible  for  the  illusion  of  upward  motion  pro¬ 
duced  by  the  rotating  spirals  of  a  barber  pole  where  either 
vertical  or  horizontal  motion  could  produce  the  local  motion  of 
the  edges,  and  the  eye  chooses  the  wrong  one.  In  order  to 
determine  both  components  of  the  flow  field  vector,  informa¬ 
tion  must  be  combined  over  regions  large  enough  to  encompass 
significant  variations  in  the  gradient  direction. 

In  brief,  the  correspondence  methods  typically  yield  full 
information  at  scattered  points,  while  the  differential  methods 
yield  partial  information  over  a  relatively  dense  set  of  points. 
Despite  a  great  deal  of  effort  expended  in  devising  flow  invari¬ 
ants,  regularization  methods,  and  matching  techniques,  neither 
approach  has  yielded  data  sufficiently  accurate  to  allow  the 
theoretical  results  to  be  reliably  applied.  Adiv  |Adiv85]  argues 
that  inherent  near  ambiguities  in  the  3-D  structure  from 
motion  problem  may  make  the  goal  of  extracting  information 
sufficiently  precise  to  allow  uniform  application  of  the  theoreti¬ 
cal  solutions  unattainable  in  practice.  Verri  and  Poggio 
[Verr87]  make  essentially  the  same  point,  arguing  that  the 
disagreement  between  the  motion  field  and  the  optical  flow 
makes  the  computation  of  sufficiently  accurate  quantitative 
values  impractical. 

Despite  the  above  problems,  a  few  studies  have  actually 
made  use  of  motion  information  extracted  from  real-world 
image  sequences.  Lawton  [Lawt83]  robustly  obtained  the 
direction  of  sensor  translation  and  the  relative  depth  of  a  set 
of  feature  points  from  image  sequences  taken  by  a  camera  res¬ 
tricted  to  translational  motion,  using  an  adaptation  of  the  gen¬ 
eralized  Hough  transform.  The  technique  worked,  however, 
only  when  the  sensor  motion  was  constrained  to  two  degrees  of 
freedom,  as  the  parameter  space  was  otherwise  of  unmanage¬ 
able  size.  More  recently,  Lucas  [Luca85]  reported  a  method  of 
obtaining  all  five  motion  parameters,  given  a  reasonable  initial 
guess,  using  a  method  based  on  gradient  descent  in  the  param¬ 
eter  space  to  match  highly  smoothed  images.  In  order  to 
achieve  this,  images  were  averaged  over  neighborhoods  of  up 
to  60  by  60  pixels,  which  allows  the  motion  parameters  to  be 
determined  within  limits,  but  yields  little  or  no  detailed  flow 
field  information  which  could  be  used  to  analyze  the  environ¬ 
ment.  Both  these  experiments  achieved  their  success  by  limit¬ 
ing  their  scope  to  a  particular  practically  motivated  aspect  of 
the  general  problem. 

HI.  FLOW  FIELD  DIVERGENCE 
AS  A  QUALITATIVE  MOTION  CUE 

Quantitative  measurements  of  the  flow  field  appear  to  be 
difficult  to  make,  yet  motion  seems  to  be  extremely  important 
in  even  the  most  primitive  natural  vision  systems.  It  is  possi¬ 
ble  of  course,  that  natural  vision  systems  are  simply  much 
better  at  extracting  quantitative  motion  data  than  current 
artificial  systems.  Another  possibility,  however,  is  that  such 
systems  make  use  of  more  qualitative  properties  of  the  flow 
field  which  can  be  calculated  more  robustly.  It  is  this  second 
possibility  which  interests  us. 

Consider  the  problem  of  visual  obstacle  avoidance  by  a 
compact,  moving  sensor.  The  solution  of  the  general  shape 
from  motion  problem  would  certainly  allow  a  safe  course  to  be 
navigated,  but  the  converse  is  not  true,  i.e.,  possession  of 
knowledge  about  a  safe  course  does  not  necessarily  permit  the 
construction  of  an  accurate  depth  map.  Thus  solving  the 
problem  of  visual  obstacle  avoidance  does  not  presuppose  the 
solution  of  the  general  shape  from  motion  problem.  If  the  only 
requirement  is  for  the  system  to  avoid  collisions,  substantially 
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less  interpretation  of  the  visual  information  is  needed,  which 
considerably  simplifies  the  problem. 

For  the  purposes  of  obstacle  avoidance,  it  is  sufficient  to 
detect  those  objects  having  relative  motion  towards  the  sensor 
in  time  to  alter  course  and  avoid  a  collision.  Exact  informa¬ 
tion  about  the  shape  and  velocity  of  the  obstacle  is  not  neces¬ 
sary.  A  variety  of  qualitative  cues,  which  are  computationally 
simpler  to  robustly  evaluate  than  accurate  global  flow  fields, 
can  indicate  the  presence  of  such  obstacles.  For  instance,  the 
approach  of  a  large,  visually  textured  object  produces  an 
expanding  region  in  the  image  which  would  thus  indicate  a 
danger  zone.  For  small  obstacles,  movement  relative  to  local 
background  motion  is  a  similar  cue.  In  neither  case  is  accurate 
quantitative  information  necessary  in  order  to  avert  collision. 
Such  cues  have  the  further  advantage  that,  being  fairly  local, 
they  do  not  depend  for  their  existence  on  global  assumptions 
such  as  the  three  dimensional  rigidity  of  the  environment.  An 
expansion  detector  would  warn  of  an  approaching  object 
whether  it  was  moving  independently  or  not. 

A  practical  system  can  be  envisioned  which  would  evalu¬ 
ate  a  number  of  such  movement  cues,  and  combine  them  to 
dynamically  maintain  a  retinal  “hazard  map”  which  could  be 
used  to  navigate  between  obstacles  In  such  a  system,  it  is  not 
necessary  that  the  cues  be  absolutely  specific.  A  certain 
number  of  false  positives  (for  instance,  a  net  divergence  detec¬ 
tor  could  be  triggered  by  rotating  objects  under  certain  cir¬ 
cumstances)  can  be  tolerated,  so  long  as  they  are  not  frequent 
enough  to  overly  clutter  the  hazard  map.  False  negatives  can 
also  be  tolerated  as  long  as  another  subsystem  detects  the 
danger.  In  general,  some  errors  are  probably  inevitable  in  sys¬ 
tems  based  on  qualitative  measurements;  they  may  occasion¬ 
ally  fail.  The  evidence  that  the  best  biological  systems  can  be 
fooled  by  appropriate  “optical  illusions”  lends  support  to  this 
view.  Because  of  the  complexity  of  the  real  world,  it  may  also 
be  unfeasible  to  mathematically  characterize  the  conditions 
under  which  such  a  system  is  guaranteed  to  work.  However, 
we  believe  that  it  is  feasible  to  design  systems  which  will  satis¬ 
factorily  perform  practical  tasks  without  recourse  to  a  quanti¬ 
tative  reconstruction  of  the  world. 

A  qualitative  measurement  which  is  particularly  useful  in 
the  context  of  obstacle  avoidance  during  visual  navigation  is 
the  divergence  of  the  motion  field.  This  is  one  of  the  potential 
candidates  mentioned  by  Thompson  in  his  position  paper  on 
“inexact"  vision  [Thom86]  The  obvious  motivation  stems  from 
the  fact  that  an  obstacle  in  relative  motion  towards  the  cam¬ 
era  produces  an  expanding  image,  i.e  ,  one  whose  image  flow 
has  positive  divergence.  Thus  regions  of  positive  divergence 
represent  potential  obstacles  with  the  distance  to  the  obstacle 
positively  correlated  to  the  magnitude  of  the  divergence  It 
can  also  be  shown  that  divergence  is  invariant  under  rotational 
motion  of  the  sensor  which  is  a  valuable  property  because  it 
allows  the  cue  to  be  utilized  even  when  the  sensor  is  not  com- 
pieteij,  stabilized. 

Approaching  obstacles  do  not,  however,  represent  the 
only  situation  producing  positive  divergence  in  the  motion 
field.  A  tilted  surface  translating  parallel  to  the  image  plane 
produces  divergent  flow,  and  certain  motion  discontinuities  are 
also  interpretable  as  divergence.  On  the  other  hand,  diver¬ 
gence  always  seems  to  be  associated,  in  some  way  or  another, 
with  the  proximity  of  an  object.  The  remainder  of  this  section 
is  devoted  to  an  analysis  of  just  what  this  relationship  is,  and 
how  it  can  be  most  effectively  utilized  by  a  navigator 


We  consider  the  image  formed  by  spherical  projection  of 
the  environment  onto  a  sphere  of  radius  p  termed  the 
image  sphere.  The  use  of  spherical  projection  makes  all 
points  in  the  image  geometrically  equivalent,  which  consider¬ 
ably  simplifies  some  of  the  analyses.  Ordinary  cameras  do  not 
utilize  spherical  projection,  but  if  the  field  of  view  is  not  too 
wide,  the  approximation  is  reasonably  close.  Since  the  distor¬ 
tion  is  purely  geometric  in  origin,  it  could  be  corrected  should 
it  prove  to  be  a  problem  for  any  particular  camera.  In  an 
experiment  we  performed  using  a  camera  with  a  field  of  view 
of  approximately  20x30  degrees,  no  correction  was  necessary 
in  order  to  obtain  usable  results. 

For  the  purposes  of  collision  avoidance,  we  are  primarily 
interested  in  the  components  of  relative  motion  parallel  to  and 
perpendicular  to  the  sensor.  For  a  given  point  p  on  the  image 
sphere,  these  can  be  conveniently  expressed  in  terms  of  local 
coordinate  systems  centered  on  p  and  on  the  center  of  projec¬ 
tion  c .  The  locations  of  points  in  the  environment  are 
expressed  in  terms  of  coordinates  (X,  Y,Z)  where  (0,0,0)  coin¬ 
cides  with  the  center  of  projection,  and  the  positive  Z  axis 
passes  through  p.  Image  positions  in  the  neighborhood  of  p 
are  expressed  in  terms  of  coordinates  (z,y)  where  (0,0) 
coincides  with  p  and  the  z  and  y  axes  wii,h  the  local  projec¬ 
tion  of  the  X  and  Y  axes  respectively.  This  is  permissible 
because  the  image  sphere  is  locally  Euclidian.  The  Euclidian 
neighborhood  of  p  will  be  referred  to  as  the 
local  projective  plane .  Since  all  points  in  the  image  are 
geometrically  equivalent  under  spherical  projection,  and  the 
divergence  is  a  local  operation,  all  our  analysis  can  be  done  in 
terms  of  these  local  coordinate  systems. 

The  relevant  characteristics  of  the  object  projecting  to  p 
are  its  distance  from  the  center  of  projection  and  the  orienta¬ 
tion  of  the  surface.  Since  the  divergence  involves  only  first 
order  derivatives,  higher  order  parameters  are  not  significant. 
The  distance  to  the  object  is  just  its  Z  coordinate.  The  orien¬ 
tation  of  the  surface  can  be  described  either  by  the  equation 
for  the  tangent  plane  Z  =  aX  +  bY  +  c  or  by  the  angles  a 
and  0  in  gradient  space  where  the  tilt ,  a,  is  the  angle  of  the 
projection  of  the  surface  normal  on  the  image  sphere,  and  the 
slant,  0  is  the  angle  between  the  surface  normal  and  the  Z 
axis.  The  following  relationships  hold  between  these  parame¬ 
ters: 


dZ 

dX 


=  tan  0cosa; 


=  tan/feino;  (l) 
a  l 


tana  =  taa0=\/a2+b2. 

b ; 

The  scenario  described  above  is  shown  in  Figure  2. 

We  now  define  a  parameterized,  one-dimensional  diver¬ 
gence  measurement  and  derive  the  relationships  between  this 
measurement  and  motion  of  the  sensor  relative  to  objects  in 
the  environment.  We  concentrate  on  this  parameterized  ver¬ 
sion  of  the  divergence  both  because  it  contains  more  informa¬ 
tion  than  the  usual  scalar,  and  because  it  is  directly  calculable 
from  real  image  sequences. 

Let  f  be  the  motion  field  under  spherical  projection  We 
define  the  directional  divergence  £>#f  to  be  the  one¬ 
dimensional  divergence  of  f  in  the  direction  <t>.  Symbolically 
we  write 

dh 

dr- 


DJ 


(2) 
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where  fp  is  the  component  of  f  in  the  <f>  direction  and  rp  is 
Euclidian  distance  in  direction  <j>.  In  terms  of  the  local  x-y 
coordinate  system,  it  is  easy  to  show  that 


where  4>  is  measured  from  the  x  axis. 

Clearly,  Dp  is  a  linear  operator.  Thus  we  can  separate 
the  contributions  of  different  motions  and  consider  them 
independently.  First,  consider  motion  of  the  sensor  in  a  rigid, 
three-dimensional  environment.  The  motion  field  at  any  point 
p  on  the  image  sphere  can  be  represented  as  the  superposition 
of  three  separate  motion  fields:  one  due  to  sensor  rotation  uj, 
one  due  to  translation  vfcrp  perpendicular  to  the  image  sphere 
at  point  p,  and  one  due  to  translation  vpar  parallel  to  the 
image  sphere  at  p .  We  will  refer  to  these  motion  fields  as  f„, , 
tpcrp  ,  and  fpor  respectively.  Hence 

t  =  ^rot  T  fperp  ~b  f par,  (4) 

and  since  Dp  is  linear 

Dpt=Dpfro,  +Dptprrp  +  Dp  ,fptr.  (5) 

The  three  preliminary  theorems  that  follow  establish  the  rela¬ 
tionship  between  the  various  types  of  motion  and  the  direc¬ 
tional  divergence. 

Theorem  1:  If  the  motion  field  is  entirely  due  to  rotation  of 
the  sensor,  then  the  directional  divergence  is  identically  zero. 
That  is, 

Dpt  rot  =  0.  (6) 

Proof:  The  procf  rests  on  the  fact  that  the  motion  field  due 
to  rotation  is  due  entirely  to  the  transformational  properties  of 
a  2-dimensional  sphere,  and  is  independent  of  the  3- 
dimensional  structure  of  the  environment.  Let  p  be  the  point 
at  which  Dp  is  being  evaluated.  Recall  that  with  every  point 
p  there  is  associated  a  coordinate  system  (X,  Y,Z)  centered  on 
the  center  of  projection  with  the  Z  axis  passing  through  p. 
By  a  coordinate  transformation,  any  rotation  of  the  sensor  can 
be  expressed  in  terms  of  three  orthogonal  rotational  com¬ 
ponents,  wx,  Wj-i  and  wz-  In  the  local  projective  plane  cen¬ 
tered  at  p,  the  motion  field  due  to  rotation  about  the  X  and 

Y  axes  is  given  by  tUjty  =  uyx  -  This  field  is  locally 

constant  and  all  the  partial  derivatives  are  zero.  Hence  there 
is  no  contribution  to  Dp(fr„,)  from  ux  or  uiY .  The  motion 
field  due  to  the  rotation  about  the  Z  axis  is  circular  about  p 
and  is  given  by  —  -w ar6  in  the  polar  coordinate  system 
centered  at  p.  For  this  field  d!2/dx  =  dfy/dy  =0  and 
dfi/dy  =  - dft/dx  =  uiz  and  hence  Dp(tUz)  =  0  by  equation 
(3).  Since  rotations  about  all  three  principle  axes  produce 
motion  fields  with  zero  divergence,  the  result  follows.  The 
principle  importance  of  this  result  is  to  show  that  that  the 
directional  divergence  is  invariant  under  rotational  motion. 
This  means  that  its  use  as  a  cue  is  not  limited  to  situations 
where  the  sensor  is  undergoing  purely  translational  motion. 

The  next  result  describes  the  effect  of  translation  perpen¬ 
dicular  to  the  image  sphere. 

Theorem  2:  Let  p  be  a  point  on  the  image  sphere,  and  sup¬ 
pose  that  the  sensor  is  translating  along  the  line  from  the 


center  of  projection  c  to  p  with  velocity  uptrp  towards  a  sur¬ 
face  S  distance  Z  from  c.  Then 


Dp{p)  =  DptP'  rp^^f-,  (7) 

independent  of  <j>  and  the  orientation  of  the  surface. 

Proof:  In  this  situation  the  motion  field  is  radial  about  p. 
Let  P  be  the  point  on  S  which  projects  to  p.  Consider  some 
other  point  Pt  on  S,  at  a  distance  R  from  the  Z  axis,  and  let 
P!  be  its  projection  on  the  image  sphere.  Let  r  be  the  radial 
distance  from  p  to  p{  in  the  local  projective  plane  (Figure  3). 
Then 

r  =  +  (second  order  terms  in  R  due  to  slope  of  S)  (8) 

where  p  is  the  radius  of  the  image  sphere.  Taking  the  time 
derivative  gives  the  radial  component  of  the  motion  field: 

f  =  —  =  dZ  _  pRvptrp  .  . 

Jr  dt  Z 2  dt  Z 2  ' 

Substituting  for  R  from  equation  (8)  gives 

fr  =  (10) 

Taking  the  partial  of  /r  with  respect  to  r  gives  the  directional 
divergence: 


D  pfpcrp 


Since  we  are  interested  in  the  divergence  at  r=0  the  second 
term  is  zero,  and  the  desired  result  follows. 


The  effect  described  above  corresponds  to  the  intuitive 
notion  of  apparent  expansion  of  an  approaching  object,  which 
suggested  the  use  of  divergence  measurements  in  the  first 
place.  However,  the  parallel  component  of  the  translation  can 
also  produce  an  effect,  and  this  must  be  considered.  Transla¬ 
tion  parallel  to  an  obstacle  produces  no  divergence  if  the  sur¬ 
face  is  parallel  to  the  direction  of  translation,  in  which  case 
locally  constant  flow  is  produced.  However,  if  the  surface  is 
tilted,  the  changing  depth  can  produce  divergence  in  the 
motion  field.  This  effect  is  described  below. 


Theorem  3:  Suppose  that  the  sensor  is  translating  in  a  direc¬ 
tion  parallel  to  the  image  sphere  at  point  p  with  velocity  vpar. 
Let  P  be  the  point  on  surface  S  which  projects  to  p,  and  let 
Z  be  the  distance  from  the  center  of  projection  to  P  The 
orientation  of  S  is  described  by  the  angles  »  and  0  where  n  is 
the  angle  between  the  direction  of  translation  and  the  projec¬ 
tion  of  the  surface  normal  at  P ,  and  0  is  the  angle  between 
the  surface  normal  at  P  and  the  Z  axis.  Let  <t>  be  measured 
from  the  direction  of  translation.  Then 

Dpt(p)  =  D ptfar  =  - —  p°r  ^cos2d>cos«  +  cosdisin^sino  j  (12) 


Proof:  Without  loss  of  generality,  we  can  take  the  direction  of 
translation  to  be  the  x  direction  in  the  local  projective  plane 
The  y  component  of  the  motion  field  /p  is  then  zero  in  the 
neighborhood  of  p  and  we  need  consider  only  the  derivatives 
of  f2.  Let  P |  be  a  point  on  5  near  P,  and  let  pt  be  its  projec¬ 
tion  on  the  image  sphere,  with  .Y  and  x  representing  the  coor¬ 
dinates  the  points  in  their  respective  coordinate  systems 
Then  we  ,ve 


*  =  (13) 

where  p  is  the  radius  of  the  image  sphere.  Taking  the  time 
derivative  we  obtain 

f  —  A  —  C  dx  —  pVf"  f!4i 

h  dt  Z  dl  z.  1  ’ 

Taking  the  partial  with  respect  to  x  gives 

df,  PVfar  8Z  vpnr  dZ  .  . 

dx  z 2  dx  z  ax'  (1  j 


From  equation  (1)  dZ /d X 

df , 


=  tan/3cosa.  Hence  we  obtain 
tan  0vp„r  coso 
Z. 


In  a  similar  manner,  we  obtain  an  expression  for  df2/dy: 

df,  =  tan/3  t)pdfsino  ^ 

dy  Z. 

Substituting  (16)  and  (17)  into  (3)  gives  the  desired  result. 
The  main  importance  of  this  result  is  to  show  that  when  cos<j> 
is  zero,  i.e.,  when  the  directional  divergence  is  considered  per¬ 
pendicular  to  the  direction  of  motion,  the  effect  of  parallel 
translation  is  zero. 

Theorems  1  through  3  describe  the  relationship  between 
the  directional  divergence  and  the  motion  of  the  sensor  in  a 
rigid,  three-dimensional  environment.  In  a  less  highly  con¬ 
strained  situation  certain  objects  in  the  environment  may  also 
be  undergoing  independent  rotational  and  translational 
motion.  We  can  use  the  results  already  derived  to  calculate 
the  divergence  in  this  case.  Let  P  be  the  point  which  projects 
to  p  on  the  image  sphere,  and  suppose  that  the  object  contain¬ 
ing  P  is  undergoing  independent  motion  relative  to  a  global 
environment  in  which  both  it  and  the  sensor  are  moving.  This 
movement  can  be  described  by  terms  specifying  the  translation 
of,  and  the  rotation  about,  P.  The  translation  can  be  broken 
into  components  perpendicular  and  parallel  to  the  image 
sphere  at  p,  which  can  be  subtracted  from  the  the  correspond¬ 
ing  components  due  to  sensor  translation  to  yield  the  relative 
translation.  The  rotation  about  P  can  be  transformed  into  a 
rotation  about  the  center  of  projection,  and  a  translation 
parallel  to  the  image  sphere  at  p.  These  can  also  be  sub¬ 
tracted  from  the  corresponding  components  due  to  r.nsor 
motion.  Hence  the  effects  of  independent  movement  •  f  the 
sensor  and  an  object  can  be  described  by  a  single  set  of  com¬ 
ponents  which  specify  the  relative  motion  between  the  two 
objects  in  terms  of  rotation  about  the  center  of  projection,  and 
translation  perpendicular  and  parallel  to  the  image  sphere  at 
p.  Note  that  the  transformation  of  the  rotational  motion  of 
the  object  does  not  affect  the  relative  perpendicular  transla¬ 
tion.  Thus  the  relative  vptrp  corresponds  to  the  actual  rate  at 
which  the  distance  between  the  sensor  and  point  P  is  chang¬ 
ing. 

Theorem  3  shows  that  the  component  of  translation 
parallel  to  an  obstacle  can  produce  divergence  in  the  motion 
field  Hence  the  presence  of  divergence  in  the  flow  is  not 
simply  equivalent  to  the  presence  of  an  obstacle  on  a  collision 
course  with  the  sensor.  However,  other  relationships  do  hold, 
which  make  divergence  useful  as  a  navigational  cue.  These 
relationships  are  based  on  the  fact  that  the  directional  diver¬ 
gence  due  to  fptr  is  zero  when  cos <t>  —  0,  that  is,  when  <j>  is 
perpendicular  to  the  direction  of  parallel  translation.  This  fol¬ 
lows  immediately  from  Theorem  3.  The  following  two 


theorems  provide  the  basis  for  the  use  of  divergence  as  a  cue  in 
obstacle  avoidance. 

Theorem  4:  If  the  distance  between  the  sensor  and  an  object 
projecting  to  point  p  on  the  image  sphere  is  decreasing,  that 
is,  the  perpendicular  component  of  the  relative  translation  is 
positive  at  p,  then  there  exists  a  direction  0  for  which  the 
directional  divergence  D^t(p)  is  positive. 

Proof:  By  Theorem  1,  the  contribution  to  D ^  from  any  rota¬ 
tional  component  is  zero.  By  Theorem  2,  the  contribution 
from  the  perpendicular  component  of  motion  is  positive  and 
independent  of  <j>.  If  we  pick  <j>  to  be  the  direction  on  the  local 
projective  plane  perpendicular  to  the  parallel  component  of 
motion,  then  the  contribution  from  the  parallel  component  is 
also  zero,  and  the  conclusion  follows. 

Essentially,  this  result  states  that  measurements  of  D p 
can  be  used  to  detect  any  object  which  has  a  component  of 
motion  towards  the  sensor  (and  is  thus  a  potential  collision 
hazard).  This  holds  for  any  combination  of  rotations  and 
translations  of  the  sensor  and  the  object.  Thus  any  collision 
hazard  can  be  detected  by  testing  whether  there  exists  some  (j> 
for  which  D t  is  positive. 

An  object  on  a  collision  course  is  always  indicated  by  a 
positive  divergence  measurement.  On  the  other  hand,  not  all 
positive  divergence  indicates  an  imminent  collision.  We  have 
seen  that  translation  parallel  to  a  inclined  surface  can  produce 
divergence  flow.  In  addition,  even  if  an  object  has  a  perpen¬ 
dicular  component  to  its  motion,  it  may  also  have  a  horizontal 
component  that  will  prevent  a  collision.  The  presence  of 
divergence  does,  however,  indicate  the  proximity  of  an  object. 
The  fact  that  all  the  terms  which  contribute  to  the  divergence 
contain  the  term  1  /Z  suggests  that  a  strong  divergence  signal 
indicates  a  nearby  obstacle.  The  situation  is  somewhat  com¬ 
plicated  by  the  presence  of  the  term  tan/3  in  the  expression  for 
the  divergence  due  to  parallel  motion.  The  term  represents 
the  slope  of  the  surface  relative  to  the  image  sphere,  which  can 
be  arbitrarily  large.  The  limiting  case  occurs  at  a  depth 
discontinuity,  where  the  slope  essentially  becomes  infinite.  At 
such  points  the  divergence  signal  also  becomes  infinite,  which 
would  seem  to  limit  its  value  as  a  proximity  indicator.  The 
difficulty  is  circumvented  by  making  use  of  the  fact  that, 
although  the  value  of  the  divergence  may  be  arbitrarily  high, 
the  total  change  in  the  flow  is  bounded.  Thus  the  divergence 
due  to  a  relatively  distant  object  can  be  large,  but  only  over  a 
very  short  distance  in  the  image.  Thus  if  the  divergence  is 
considered  over  a  local  region,  it  can  be  used  as  a  proximity 
indicator.  The  following  theorem  makes  the  relationship  expli¬ 
cit. 

Theorem  5:  If  \Dt\>D  >0  within  a  disk  of  radius  r  about  p. 
where  r  is  small  with  respect  to  the  radius  p  of  the  image 
sphere,  then 

*n,i„<max{|^L  ,  },  (18) 


the  disk 

Proof:  Since  there  are  only  two  sources  of  divergence,  either 
vprrp  or  vpar  must  account  for  at  least  half  the  value  of  D  If 
the  major  contribution  is  from  vprrp.  then  from  equation  (7) 


f  <  l-'Vrrrl 
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Hence 


2« 

^mi„<  -p-  (20) 

If,  on  the  other  hand,  the  major  contribution  is  from  vpar ,  then 
there  must  be  a  change  A/p  in  the  How  of  at  least  2 rD  /2  due 
to  the  change  in  depth  across  the  disk  from  Zm to  Zmpx. 
From  equation  (14)  we  have 

rD  <  A/#  =  -jr~  ~  (21) 

^  min  max 

Now  Zm |n  is  maximized,  that  is  the  object  is  at  the  greatest 
possible  distance,  when  Zmpx  is  equal  to  infinity;  hence 


The  result  follows  from  (20)  and  (22). 


Theorem  5  shows  that  the  presence  of  divergence  over 
any  significant  area  indicates  an  object  which  is  nearby  in 
terms  of  the  translational  velocity  of  the  sensor.  The  result 
(18)  is  valid  for  arbitrary  translations  and  rotations  of  the  sen¬ 
sor  and  the  object,  if  the  components  of  relative  motion  are 
used.  However,  it  should  be  noted  that  rotation  of  the  object 
results  in  a  relative  value  of  tt  proportional  to  Z,  which 
means  that  a  rotating  object  can  produce  divergence  even  at  a 
large  distance.  This  would  not  seem  to  be  an  important  con¬ 
sideration  in  most  practical  applications,  since  most  obstacles 
do  not  rotate  independently. 


The  above  results  show  that  the  motion  field  divergence 
is  a  useful  but  rather  equivocal  cue  if  rotational  motion  is 
present  in  the  system.  If  the  rotational  component  of  the 
motion  is  zero  or  known,  so  that  the  effects  can  be  factored 
out,  somewhat  stronger  results  can  be  obtained.  In  particular, 
if  the  How  is  due  purely  to  translation,  then  the  direction  of 
vpar  is  the  same  as  the  direction  of  the  flow.  Since  this  is 
easily  determined,  0  can  be  chosen  perpendicular  to  this  direc¬ 
tion,  and  D $  used  to  determine,  unequivocally,  whether  or  not 
vptrp  is  positive  Furthermore,  if  only  translational  motion  is 
present,  then  the  object  projecting  to  p  is  on  a  collision  course 
if  and  only  if  1) p  is  positive  at  p,  and  the  How  is  zero.  In  this 
situation,  p  corresponds  to  a  focus  of  expansion  III  practice, 
these  results  might  find  considerable  application,  since  it  is 
relatively  easy  to  obtain  the  rotational  components  of  the 
motion  from  simple  inertial  sensors  It  is  also  possible  to 
robustly  obtain  the  motion  parameters  of  the  sensor  by  purely 
visual  means  using  a  IIOO  degree  field  of  view  [Nels87[.  In  a 
system  stabilized  by  either  means,  the  stronger,  translational 
results  could  be  used. 


The  theoretical  results  can  be  summarized  as  follows. 
For  a  sensor  with  arbitrary  rotational  and  translational 
motion,  any  object  on  a  collision  course  with  the  sensor  will 
produce  a  positive  directional  divergence  in  the  motion  field 
for  some  angle  0,  however,  not  all  positive  divergence  indicates 
an  imminent  collision  Conversely,  any  non-zero  directional 
divergence  measurement  indicates  the  presence  of  an  object 
which  is  nearby  in  terms  of  the  relative  translational  velocities 
of  the  sensor  and  the  object,  however,  not  all  nearby  objects 
produce  divergence  in  the  motion  field  As  far  as  theoretical 
use  of  information  contained  in  the  motion  field  is  concerned, 
these  results  are  sufficient  for  low-level  obstacle  avoidance, 
although  used  alone,  they  could  result  in  maneuvers  which  are 
not  strictly  necessary  to  avoid  collision  The  results  can  be 


strengthened  to  eliminate  such  situations  if  only  translational 
motion  is  permitted. 

The  above  results  are  all  fairly  simple,  and  the  informa¬ 
tion  available  is  somewhat  limited.  The  primary  value  of  the 
analysis  lies  in  the  fact  that  the  described  divergence  measures 
can  actually  be  obtained,  easily  and  reliably,  from  real  image 
sequences.  Motion  field  divergence  is  also  a  robust  cue  in  the 
sense  that  it  is  stable  in  the  presence  of  significant  amounts  of 
error  in  the  motion  field.  Since  determining  the  motion  field 
precisely  is  a  difficult  task,  this  is  a  highly  desirable  charac¬ 
teristic.  Divergence  measurements  can  be  obtained  which  are 
accurate  and  reliable  enough  to  be  usable  for  navigation  in  the 
manner  described.  The  remainder  of  the  paper  is  devoted  to 
demonstrating  these  claims. 

IV.  DESIGN  OF  AN  OBSTACLE  AVOIDANCE  SYSTEM 

The  results  obtained  above  suggest  that  detection  of 
directional  divergence  in  the  motion  field  could  form  the  basis 
for  a  simple  visual  obstacle  avoidance  system.  In  order  to 
investigate  this  possibility  we  designed  a  system  to  perform 
navigation  using  motion  field  divergence  as  the  sole  sensory 
cue.  This  system  utilizes  a  sequence  of  pictures  taken  by  a 
camera  mounted  on  a  moving  robot  arm.  Periodically,  flow 
field  divergence  is  computed  from  the  image  sequence  and 
analyzed  for  evidence  of  obstacles.  This  information  can  then 
be  used  to  guide  the  camera  away  from  potential  collision 
hazards.  This  section  describes  the  design  of  the  image  pro¬ 
cessing  system  responsible  for  extracting  usable  information 
from  a  sequence  of  images. 

Motion  information  was  extracted  using  a  differential 
technique.  This  approach  was  chosen  because  it  produces  a 
relatively  dense  field  of  values  and  because  it  uses  only  simple 
local  operations.  It  is  thus  directly  implemcntable  in  a  highly 
parallel,  connectionist  type  architecture  which  would  be  a  con¬ 
sideration  in  any  real  system.  Correspondence  methods,  in 
addition  to  the  problem  of  identifying  correspondences,  pro¬ 
duce  a  sparse  field  of  values,  which  makes  calculation  of  a  con¬ 
tinuous  measurement  such  as  divergence  difficult. 

Use  of  a  differential  method  means  dealing  both  with  the 
non-identity  of  the  motion  field  with  the  optical  flow,  and  with 
the  aperture  problem.  The  first  problem  was  eliminated  by 
assuming  sufficient  visual  texture  in  the  image  to  avoid  any 
difficulty.  Since  the  divergence  is  a  highly  robust  measure¬ 
ment,  this  is  not  a  very  stringent  requirement.  Ordinary 
natural  objects  such  as  tree  bark,  broken  brick,  and  human 
faces  were  found  to  have  sufficient  visual  texture  to  make  the 
method  workable  The  aperture  problem  arises  from  the  fact 
that,  at  a  given  point  and  time,  only  one  component  of  the 
optical  flow  can  be  recovered,  namely  the  component  parallel 
to  the  local  image  gradient.  This  problem  was  approached  by 
defining  a  number  of  principal  axes  and  processing  motion 
parallel  to  each  axis  separately.  For  each  axis,  the  parallel 
flow  component  can  be  determined  at  some  fraction  of  the 
points  in  the  image,  namely  those  where  the  gradient  happens 
to  be  parallel  to  the  axis.  Information  for  each  axis  is  con¬ 
tained  in  a  structure  referred  to  its  a  partial  map  These 
structures  ran  be  thought  of  ;ls  arrays  having  the  same  size  as 
the  original  image  but  with  unknown  values  at  some  points 
Thus  we  obtain  a  number  of  partial  maps,  each  of  which 
specifies  the  component  of  the  flow  field  parallel  to  a  particular 
axis  at  some  fraction  of  the  image  points  The  directional 
divergence  can  then  be  independently  calculated  for  each 
direction  from  the  partial  flow  maps  and  the  the  results  com- 
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bined  for  use  by  the  navigation  system.  This  process  is 
described  in  more  detail  later. 

The  differential  calculations  are  most  easily  implemented 
when  the  image  motion  between  frames  is  incremental.  The 
system  was  thus  designed  to  utilize  sequences  where  the  max¬ 
imum  image  flow  is  less  than  one  pixel  per  frame.  In  a  system 
exposed  to  a  large  range  of  flow  magnitudes,  simultaneous 
analysis  of  different  resolution  versions  of  the  image  might  be 
necessary.  However,  for  a  sensor  moving  at  a  relatively  con¬ 
stant  speed  through  a  static  environment,  a  single  level  system 
proved  to  be  sufficient.  The  frame  rate  was  chosen  to  allow  an 
approaching  obstacle  to  be  detected  in  time  to  avoid  it. 

An  important  strategy  for  reducing  the  level  of  noise  in 
the  system  and  obtaining  divergence  measurements  robust 
enough  to  be  reliable  was  the  principle  of  accumulation  of 
information  over  time.  The  strategy  is  based  on  the  fact  that 
motion  features  tend  to  persist  in  the  same  neighborhood  for 
several  frames.  By  combining  flow  estimates  computed  for 
several  overlapping  blocks  of  time,  the  reliability  of  the  flow 
estimate  can  be  increased.  This  improvement  is  a  result  of 
averaging  partially  independent  estimates  for  the  same  value, 
which  is  a  well  known  technique.  Most  applications  of  averag¬ 
ing  in  image  processing,  however,  utilize  spatial  averaging 
which  has  the  undesirable  side  effect  of  also  reducing  resolu¬ 
tion.  Because  the  differential  techniques  of  flow  approximation 
rely  on  the  small  scale  variation  or  texture  in  the  image,  any 
blurring  of  the  image  by  spatial  averaging  severely  degrades 
the  performance.  This  does  not  occur,  however,  with  temporal 
averaging. 

The  temporal  accumulation  of  information  also  improves 
performance  in  another  way.  Note  that  if  only  a  single  motion 
computation  is  made,  the  known  points  of  the  partial  flow 
maps  will  form  disjoint  sets  since  the  gradient  at  a  point  has  a 
unique  direction.  However,  if  information  is  combined  over 
several  time  frames,  the  number  of  known  points  in  each  map 
increases  and  the  sets  can  overlap,  especially  if  the  image  con¬ 
tains  fine  grained  visual  textures  which  cause  the  gradient 
direction  at  a  point  to  change  rapidly.  Thus  not  only  is  the 
information  more  accurate,  but  there  is  more  of  it.  In  our 
experiments,  the  introduction  of  temporal  accumulation  of 
information  dramatically  increased  the  reliability  of  our  sys¬ 
tem. 

The  process  of  converting  an  image  sequence  to  a  retinal 
hazard  map  involves  the  following  steps. 

1  Creation  of  partial  maps  on  the  basis  of  gradient  direc¬ 
tion,  and  computation  of  snatial  and  temporal  deriva¬ 
tives. 

2.  Computation  of  the  parallel  component  of  optical  flow  for 
each  partial  map. 

3.  Temporal  accumulation  of  information  by  combining 
separately  computed  estimates  of  the  flow  at  several 
closely  spaced  times. 

1  Approximation  of  the  directional  divergence  from  partial 
flow  maps  in  each  principal  direction 

5.  Combination  of  directional  divergence  maps  to  produce  a 
retinal  hazard  map. 

We  now  give  a  more  detailed  description  of  the  above  steps. 

The  images  acquired  by  the  camera  were  512X512  pixels 
in  size  Before  processing,  these  were  reduced  using  averaging 
by  a  factor  of  four  to  128X128.  This  was  done  primarily  to 
reduce  the  processing  and  storage  requirements  to  a  manage¬ 


able  level,  but  had  the  side  effect  of  making  it  easier  to  obtain 
image  sequences  with  subpixel  movement  between  frames.  A 
set  of  four  sequential  gray  scale  images  referred  to  as  a 
4 -sequence  was  used  as  a  basis  for  computing  the  spatial  and 
temporal  derivatives  used  in  extracting  flow  information. 

The  first  step  was  to  determine  gradient  information  by 
applying  the  four  principal  3x3  Sobel  operators  (correspond¬ 
ing  to  vertical,  horizontal,  and  two  diagonal  axes)  to  each  of 
the  images  in  the  4-sequence  (Figure  4).  The  image  points 
were  assigned  to  partial  maps  on  the  basis  of  the  Sobel  opera¬ 
tor  having  the  greatest  response.  The  gradient  information 
from  the  four  original  images  was  combined  to  produce  direc¬ 
tional  gradient  maps  in  the  following  manner.  For  every  pixel, 
if  the  gradient  direction  was  the  same  in  each  of  the  four 
images,  then  the  corresponding  pixel  in  the  partial  map  for 
that  direction  was  set  to  the  average  of  these  directional  gra¬ 
dients.  All  other  pixels  in  the  partial  maps  were  labeled  as 
unknown.  The  idea  was  to  ensure  that  the  gradient  features 
used  in  calculating  the  flow  were  persistent  enough  to  obtain  a 
reliable  value.  The  result  of  this  is  four  partial  maps 
corresponding  to  the  vertical,  horizontal,  and  two  diagonal 
axes,  each  containing  all  those  points  at  which  there  is  a  per¬ 
sistent  gradient  in  the  specified  direction.  More  directions 
could  be  used  in  an  attempt  to  increase  accuracy,  but  this 
would  require  larger  directional  operators,  which  could  blur 
out  small  textural  features,  and  would  also  reduce  the  density 
of  the  known  points  in  each  partial  map. 

The  time  derivative  was  approximated  by  averaging  the 
gray  scale  values  over  a  3X3  neighborhood  in  each  image  in 
the  4-sequence,  and  summing  the  averages  using  the  weight 
vector  (-1,-1, 1,1),  that  is,  subtracting  the  earlier  values  from 
the  later.  A  least  squares  fit,  (-3, -1,1, 3),  is  not  appropriate  in 
this  case  because  it  weighs  the  distant  points  more  heavily 
when  there  is  no  reason  to  suppose  they  are  more  important. 
Using  the  time  derivative,  the  parallel  component  of  the  image 
flow  was  then  computed  for  each  direction  by  dividing  the 
time  derivative  by  the  gradient  magnitude  at  all  points  where 
the  gradient  magnitude  was  sig-  ficantly  greater  than  zero. 
This  yielded  four  partial  maps  each  specifying  a  particular  pro¬ 
jection  of  the  flow  field  at  a  set  of  points. 

Flow  field  information  was  obtained  as  above  for  eight 
overlapping  4-sequences,  producing  eight  registered  partial 
maps  for  each  direction  (Figure  5).  Information  was  combined 
separately  for  each  direction  by  averaging  the  known  values  at 
every  location.  If  all  values  at  a  certain  location  were  unk¬ 
nown,  then  the  resultant  value  in  the  combined  partial  map 
remained  unknown.  This  temporal  accumulation  of  informa¬ 
tion  both  increases  the  number  of  known  points  for  each  direc¬ 
tion  and  improves  the  accuracy  of  the  estimates. 

The  directional  divergence  was  computed  for  each  direc¬ 
tion  from  the  combined  flow  map.  This  operation  made  use  of 
large  masks,  approximately  20x20  pixels,  symmetrically 
divided  into  positive  and  negative  halves  (Figure  6).  The 
number  of  known  points  falling  inside  the  mask  at  each  posi¬ 
tion  was  counted,  and  if  this  was  greater  than  a  certain  thres¬ 
hold  (e  g.  10),  then  the  information  was  deemed  sufficient  for 
an  accurate  computation  of  the  divergence  The  divergence 
was  then  calculated  by  subtracting  the  average  of  the  values  in 
the  negative  half  of  the  mask  from  the  average  of  the  values  in 
the  positive  half.  Where  there  was  insufficient  information, 
the  pixel  was  labeled  as  unknown.  The  large  number  of  unk¬ 
nown  points  in  the  motion  maps  allowed  this  process  to  be 
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efficiently  implemented  as  a  generalized  Hough  transform, 
which  somewhat  compensated  for  the  large  size  of  the  masks. 

The  next  step  was  to  combine  the  different  one¬ 
dimensional  divergence  maps.  This  was  done  disjunctively, 
that  is,  the  divergence  of  the  combined  map  was  set  to  the 
average  of  the  separate  divergences  if  at  least  one  of  the 
separate  maps  had  a  known  value  at  that  position.  It  might 
be  supposed  that  a  more  robust  procedure  would  be  to  com¬ 
bine  the  separate  divergence  maps  conjunctively,  that  is,  to 
require  supporting  evidence  from  at  least  two  directions.  In 
practice,  however,  this  did  not  yield  good  results,  primarily 
because  texture  edges  in  real  images  often  possess  a  local  corre¬ 
lation  which  prevents  recovery  of  information  for  more  than 
one  gradient  direction. 

The  combined  divergence  map  was  then  converted  into  a 
navigation  hazard  map  as  follows.  Regions  where  a  divergence 
value  had  been  computed  which  had  a  magnitude  less  than  the 
expected  error  of  the  computation  were  labeled  “safe”,  i.e.,  no 
detectable  divergence.  Regions  with  a  divergence  value  of 
magnitude  greater  than  the  expected  error  were  labeled 
“danger”  in  proportion  to  the  magnitude  of  the  divergence. 
(In  this  experiment  regions  of  high  negative  divergence  were 
also  labeled  as  dangerous  since  they  probably  indicate  the 
proximity  of  an  obstacle.)  Finally  regions  where  no  divergence 
could  be  computed  were  labeled  “caution”. 

This  hazard  map  was  used  by  a  simple  navigation  pro¬ 
gram  to  guide  the  robot  bearing  the  camera.  Basically,  the 
field  of  view  was  divided  into  a  3X3  grid  and  the  robot 
directed  towards  the  square  having  the  greatest  “safe”  and  the 
smallest  “danger”  and  “caution”  areas.  More  complex 
schemes  could  easily  be  devised,  but  this  proved  sufficient  to 
demonstrate  the  viability  of  the  technique. 

The  computation  of  the  directional  divergence  described 
above  can  be  implemented  in  a  highly  parallel,  layered  connec- 
tionist  architecture  using  only  local  connections.  fr,  each  layer 
there  is  one  processor  per  pixel  which  receives  information 
from  a  local  neighborhood  in  the  layer  below.  Information 
passes  through  the  layers  is  one  direction,  so  there  is  no  prob¬ 
lem  with  settling  times.  Time  averaging  and  calculation  of 
time  derivatives  can  be  performed  by  means  of  delay  buffers. 
Figure  7  shows  a  schematic  diagram  of  one  connectionist  archi¬ 
tecture  which  could  be  used.  The  squares  represent  data 
storage  in  the  form  of  partial  images,  and  the  circles  represent 
operations  by  which  the  partial  images  are  transformed  into 
others.  Stacks  of  squares  represent  delay  buffered  data  used 
for  the  temporal  accumulation  of  information  The  arrows 
show  the  data  dependencies  and  the  flow  of  information 
through  the  system. 

V.  IMPLEMENTATION  AND  TESTING 

The  system  described  above  was  investigated  using  an 
American  Robot  Merlin  robot  arm  having  six  rotational  joints 
and  a  sphere  of  operation  approximately  two  meters  in  diame¬ 
ter  A  Sony  DC-37  CCD  miniature  television  camera  with  a 
16mm  focal  length  lens  was  attached  to  the  arm.  The  arm 
was  used  to  move  the  camera  through  a  three  dimensional 
environment  containing  various  obstacles.  Some  attention  was 
necessary  to  ensure  that  the  obstacles  in  the  environment  did 
not  interfere  with  the  motion  of  parts  of  the  arm  distant  from 
the  camera,  since  solving  the  motion  planning  problem  for  the 
entire  robot  was  not  a  goal  of  this  study.  In  general  however, 
the  arrangement  provides  a  flexible  method  for  investigating 
the  problems  of  visual  navigation  In  order  to  avoid  becoming 


entangled  in  the  so  called  piano  mover’s  problem,  experiments 
were  carried  out  in  relatively  uncluttered  environments,  i.e., 
ones  in  which  obstacles  can  be  avoided  by  changing  course  or 
moving  aside. 

Since  the  differentially  based  detection  of  divergence 
depends  on  the  presence  of  small  scale  variation  in  the  image 
gray  level,  both  the  obstacle  and  the  background  were  chosen 
to  provide  strong  visual  texture.  The  sort  of  visual  texture 
needed  is  present  in  most  natural  objects,  and  we  ran  our  tests 
using  obstacles  such  as  broken  cinder  bricks  and  pieces  of  tree 
bark  with  satisfactory  results.  It  is  a  curious  property  of  our 
method  that  because  of  the  requirement  for  complex  visual 
textures,  it  would  not  be  expected  to  work  well  in  the  blocks 
world. 

As  described  above,  the  unit  of  input  to  the  system  is  a 
sequence  of  12  images,  that  is  8  overlapping  4-sequences.  In 
the  scale  chosen  for  the  demonstration,  this  corresponds  to  a 
camera  translation  of  one  to  two  inches.  The  image  processing 
was  performed  on  a  VAX  11/785  computer,  and  the  time 
needed  to  extract  the  divergence  from  the  basic  sequence  of  12 
128X128  images  (reduced  from  512x512)  was  on  the  order  of 
several  minutes.  This  is  not  quite  fast  enough  for  real-time 
operation;  however,  the  algorithms  were  deliberately  designed 
to  be  implementable  on  a  fine  grain  parallel  architecture  or 
neural  net,  and  would  be  extremely  fast  on  such  a  system. 

We  first  tested  the  response  of  the  system  to  individual 
translational  and  rotational  movements  to  see  whether  the 
output  of  the  divergence  detector  corresponded  to  theoretical 
predictions.  This  was  essentially  a  test  of  the  ability  of  the 
detector  to  respond  to  divergence  in  actual  image  sequences  in 
a  reliable  and  robust  manner.  The  environment  for  these  tests 
consisted  of  a  distant  textured  background  (actually  a  piece  of 
cloth  with  a  complicated  Paisley  print),  and  an  obstacle  in  the 
foreground  (variously  a  piece  of  tree  bark  or  a  broken  cinder 
block). 

The  first  test  consisted  of  moving  the  camera  directly  for¬ 
ward  towards  the  obstacle.  Theoretically,  the  obstacle  should 
display  a  greater  divergence  than  the  background  and  should 
be  detectable  on  that  basis.  Figure  Pi  shows  the  view  seen  by 
the  camera  at  the  beginning  of  motion  towards  the  obstacle,  in 
this  case  a  piece  of  bark.  In  sequence  of  12  images  input  to 
the  system,  the  bark  expands  by  about  10%,  corresponding  to 
a  camera  motion  of  about  one  and  one  half  inches.  The 
change  between  frames  is  almost  indistinguishable  to  the  eye 
when  two  successive  images  are  viewed  side  by  side.  However, 
it  is  interesting  to  note  that  if  the  two  images  are  flashed 
alternately  on  the  same  screen,  a  distinct  difference  is  noted, 
and  the  bark  seems  to  pop  back  and  forth.  Figure  P2  shows 
the  hazard  map  generated  by  motion  directly  towards  the  obs¬ 
tacle.  White  areas  correspond  to  regions  labeled  “danger”, 
gray  to  regions  labeled  “caution”,  and  black  to  “safe"  regions. 
The  bark  stands  out  clearly  as  an  obstacle.  The  gray  border 
around  the  image  is  an  artifact  of  the  lack  of  any  motion 
information  outside  of  the  image. 

In  the  next  test,  the  camera  was  moved  parallel  to  the 
obstacle.  In  this  case,  according  to  our  theory,  the  flow 
discontinuities  at  the  boundaries  of  the  obstacle  should  stand 
out.  Figure  P3  shows  the  results  of  this  test  for  the  piece  of 
bark.  As  expected,  the  boundaries  stand  out  clearly 

We  next  tested  the  principle  of  invariance  under  rotation 
Recall  that  the  divergence  was  invariant  under  rotation  of  the 
camera.  Thus  image  sequence  resulting  from  pan,  tilt,  or  roll 


of  the  camera  should  show  no  divergence.  Figure  P4  shows 
the  hazard  map  resulting  from  a  camera  pan  of  about  two 
degrees  corresponding  to  an  image  motion  of  approximately 
one  half  pixel  per  frame.  As  expected,  little  or  no  divergence  is 
detected.  Figure  P5  shows  the  hazard  map  resulting  from  a 
camera  roll  of  about  10  degrees.  The  flow  field  for  this  motion 
is  a  circular  whirlpool  about  the  center  of  the  image. 
Mathematically,  the  divergence  for  this  pattern  is  zero,  but  the 
interaction  of  the  components  is  more  critical  than  in  the  pre¬ 
vious  case.  There  are  a  few  small  regions  of  spurious  diver¬ 
gence  reported  for  this  sequence,  but  basically,  the  map  is  clear 
as  expected. 

The  above  tests  indicated  that  our  divergence  detector 
worked  fairly  robustly  in  detecting  the  qualitative  divergence 
features  predicted  by  the  theoretical  analysis.  The  information 
should  thus  be  usable  for  navigational  purposes.  We  tested 
this  by  implementing  a  very  simple  navigation  algorithm. 
Basically,  the  camera  straight  ahead  for  the  distance 

necessary  to  obtain  a  sequence  of  12  images.  This  sequence  is 
then  analyzed  to  produce  a  hazard  map.  This  map  is  then 
divided  into  sections  by  a  3X3  grid.  Each  section  is  assigned 
a  hazard  index  by  adding  the  number  of  “danger”  points  to 
1/10  the  number  of  “caution”  points.  The  direction  of  move¬ 
ment  is  then  changed  to  point  towards  the  center  of  the  sec¬ 
tion  with  the  lowest  hazard  index,  and  another  sequence  of  12 
images  is  acquired. 

Figures  N1-N10  show  the  results  of  a  test  run  where  the 
system  navigated  through  the  gap  between  two  pieces  of  bark. 
Figures  Nl-NlO  (a)  show  the  scene  viewed  by  the  camera  at 
the  beginning  of  each  step,  and  Figures  N1-N10  (b)  show  the 
hazard  map  at  the  end  of  each  step.  The  obstacles  were  large 
enough  to  prevent  the  robot  from  seeing  around  them,  and  the 
robot  was  prevented  from  navigating  up  and  over  the  obsta¬ 
cles  by  allowing  only  left-right  directional  changes.  This 
forced  the  system  to  find  the  gap.  Figure  8  shows  the  path 
followed  by  the  camera  as  it  approached  the  obstacles.  Note 
how  the  robot  switched  course  back  and  forth  to  in  order  to 
keep  heading  for  the  gap  between  the  obstacles. 

Thus  we  observe  that  even  a  very  simple  system,  without 
any  elaborate  spatial  reasoning  abilities,  can  exhibit  significant 
visual  navigational  skill.  There  Is  certainly  room  for  improve¬ 
ment  in  the  navigation  algorithm.  For  example,  the  hazard 
map  could  be  updated  more  frequently,  rather  than  at  the  end 
of  disjoint  12-image  sequences,  which  would  allow  curves  to  be 
negotiated  more  smoothly.  Since  the  divergence  is  unaffected 
by  camera  rotation,  changing  the  direction  of  translation  con- 
‘inuously  rather  than  at  discrete  points  would  pose  no  prob¬ 
lem.  The  system  could  also  be  instructed  to  handle  situations 
where  no  free  path  is  visible  by  making  a  large  turn  or  backing 
up  and  turning.  The  problem  could  also  be  approached  by 
expanding  the  effective  field  of  view  with  additional  cameras. 
This  would  permit  the  robot  to  sense  obstacles  to  the  sides  as 
well  as  straight  ahead. 

Overall,  the  system  displayed  a  high  degree  of  robustness, 
as  would  be  expected  with  a  qualitative  approach.  The  results 
are  relatively  insensitive  to  noise  in  the  images.  This  is  evi¬ 
dent  from  the  fact  that  the  procedure  works  with  real  images 
taken  from  a  robot  mounted  television  camera  In  fact,  replac¬ 
ing  one  or  two  images  in  the  sequence  with  other  images  hav¬ 
ing  no  relation  to  the  scene  at  hand  affects  the  results  hardly 
at  all.  (This  was  done  accidentally  during  the  course  of  the 
experiment)  The  internal  thresholds  used  to  determine 
whether  a  value  represents  information  or  noise  were  deter¬ 


mined  a  priori  from  rough  geometric  error  analysis.  Thus  no 
extensive  testing  was  necessary  in  order  to  determine  parame¬ 
ter  values.  In  general,  the  results  seem  to  be  insensitive  to  the 
details  of  the  implementation  as  long  as  the  general  outlines  of 
the  procedure  are  followed.  This  is  an  encouraging  property. 

VI.  CONCLUSIONS 

We  have  argued  that  the  flow  field  divergence  represents 
a  qualitative  measurement  which  can  be  used  as  a  powerful 
cue  in  the  process  of  obstacle  avoidance  during  visual  naviga¬ 
tion.  We  have  further  demonstrated  that  the  flow  field  diver¬ 
gence  can  be  computed  robustly  from  real  image  sequences 
using  relatively  simple  operations.  Two  basic  principles  which 
facilitated  this  task  were  directional  decomposition  of  the 
problem  into  multiple  one-dimensional  domains,  and  temporal 
accumulation  of  information  which  permitted  the  use  of 
many-image  sequences.  Finally,  we  presented  a  practical 
demonstration  in  which  image  flow  divergence  was  used  to 
maneuver  a  vehicle  between  obstacles  in  a  real  world  environ¬ 
ment. 

In  a  more  general  sense,  we  believe  this  work  supports 
two  claims.  First,  that  many  problems  in  vision  can  be  solved 
without  the  sort  of  exact,  quantitative  reconstruction  of  the 
world  which  has  proven  so  difficult  to  obtain.  Obstacle 
avoidance  using  a  qualitative  measure  of  flow  field  divergence 
is  one  example  of  such  a  problem.  Other  simple  examples 
include  visual  stabilization  and  target  tracking.  It  is  also  pos¬ 
sible  that  more  complex  problems  such  as  bin  picking,  and 
high  level  navigation  tasks  such  as  docking,  can  be  performed 
without  reference  to  an  explicit  quantitative  representation  of 
the  world  through  a  direct  association  of  patterns  with  actions. 
We  are  currently  pursuing  research  in  this  area.  Second,  in 
the  study  of  autonomous  systems,  the  development  of  general 
and  robust  solutions  to  specific  tasks  is  at  least  as  important 
as  the  development  of  mathematical  solutions  to  general  prob¬ 
lems  which  hold  under  specific  theoretical  conditions. 
Specifically,  for  many  problems,  practical  solutions  of  the 
latter  kind  may  prove  infeasible  or  at  least  inappropriate, 
because  of  the  difficulty  of  mathematically  characterizing  real 
world  environments  in  a  useful  way.  This  project  has 
approached  the  use  of  motion  sequences  from  the  perspective 
of  the  particular  practical  problem  of  obstacle  avoidance  by  a 
moving  sensor  rather  than  in  the  general,  theoretical  manner 
which  has  characterized  much  of  the  previous  work  on  motion 
and  image  flow. 

This  research  has  applications  in  the  visual  guidance  of 
industrial  robots,  and  in  the  navigation  of  autonomous  aircraft 
or  submarine  vehicles.  The  basic  premise  is  that  once  a  basic 
navigational  procedure  for  the  avoidance  of  obstacles  has  been 
established,  higher  level  directives  (such  as  “try  to  go  north”) 
can  be  imposed  on  the  system  since,  in  most  cases,  consider¬ 
able  freedom  exists  in  choosing  a  path  between  obstacles. 
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Abstract 


We  present  an  operational  perception  system  for  cross¬ 
country  navigation  which  has  been  verified  in  both  sim¬ 
ulate  d  and  real  world  environments.  Range  data  from  a 
laser  range  scanner  is  transformed  into  an  alternate  rep¬ 
resentation  called  the  Cartesian  Elevation  Map  (C'EM).  A 
detailed  vehicle  model  operates  on  the  CEM  to  produce 
traversahility  information  along  selected  trajectories.  This 
information  supports  a  real-time  reflexive  planning  system. 
During  the  months  of  July  and  August,  19S7  we  success¬ 
fully  demonstrated  our  obstacle  detection  and  avoidance  al¬ 
gorithms  on  board  the  Autonomous  Land  Vehicle  at  Martin 
Marietta  Aerospace  Corporation  in  Denver.  We  also  pro¬ 
pose  enhancements  to  our  current  operational  system. 


1.  INTRODUCTION 

Think  for  a  moment  of  the  perceptual  differences  between 
the  indoor  world  and  natural  outdoor  terrain.  A  perception 
system  which  extracts  features  of  rectalinear  surfaces  for 
recognition  purposes  may  be  well  suited  for  identification  of 
man  made  objects  in  an  indoor  environment,  but  the  same 
system  would  fail  miserably  when  faced  with  an  outdoor 
scene.  The  distinguishing  features  of  indoor  objects  such 
as  chairs,  tables,  carpets,  and  walls  have  specific  meanings 
in  I  he  indoor  environment  where  they  are  found.  For  exam 
pie.  the  vertical  edges  of  a  chair  would  be  perceived  as  an 
ob-tacle  to  mobility,  whereas  the  color  of  the  carpet  may 
be  a  clue  to  the  current  location  Attempting  to  extract 
featmes,  identify  objects,  and  build  a  description  of  the  on 
vironment  in  nrdej  to  navigate  is  a  practical  approach  for 
mdooi  mobile  robots  since  in  most  cases  the  int raversable 
regions  are  occupied  by  veil  u  ni  obstacles  and  the  floor  is 
flat  However,  meaningful  objects  in  natural  terrain  are 
much  mote  difficult  to  char adei  i/e.  Specitic  features  such 
a,  slope,  ihteo  dimensional  edges  leg.  a  I  ii  upt  changes  in 
v.  1 1  teal  height  of  the  teiiain).  ot  colot  of  specific  regions 
ma’,  'a  may  not  signal  ob’-t aeles  an  I  are  difficult  to  extract 
and  combine  to  form  i el i able  desci  ipt ions  of  t lie  I »-i  lain  Tor 
*  sample,  tire,  can  have  widi-lv  varying  shapes,  colors,  and 
i/i  ■  making  1 1  a-  tecognit  ion  in  it  elf  a  formidable  p  oblem 

I  hi  1 '  catch  was  aipp.ilted  III  pall  by  the  Defense  All 
..  a  i,i  I  ii  |{,  .1  n|ch  I’io|ei  t  A  gi-ncy  unde  I  ..miiact  DA<  Adi 
t  tit iii , 


With  current  capabilities,  collecting  object  descriptions  in 
such  a  manner  provides  at  best,  a  poor  understanding  of  the 
environment  and  is  not  sufficient  for  safe  vehicle  navigation. 

In  the  context  of  cross-country  navigation,  a  more  satisfar 
t.ory  definition  of  obstacles  is  any  region  a  particular  ve¬ 
hicle  cannot  traverse.  The  definition  of  obstacles  depends 
upon  the  mobility  characteristics  of  the  vehicle.  There¬ 
fore,  a  natural  approach  to  perception  for  cross  country 
navigation  models  the  vehicle’s  interaction  with  a  three  di¬ 
mensional  representation  of  the  sensed  terrain  at  locations 
of  interest  to  determine  traversahility.  In  this  paper  we 
describe  such  an  operational  perception  system  for  navi¬ 
gation  in  complex  outdoor  terrain  with  the  Autonomous 
Land  Vehicle  (ALV).  Thi'  ALV  is  an  eight  wheeled  vehi¬ 
cle  approximately  5  meters  long  and  3  meters  wide  with 
a  minimum  clearance  of  roughly  22  centimeters  (Lowrie 
et.  al.  1985).  Three  dimensional  data  ir;  obtained  from  a 
laser  range  scanner  developed  by  the  Environmental  Re¬ 
search  Institute  of  Michigan  (ERIM)  (Zuk  and  Deileva 
19S3)  (Sampson  1987).  The  combination  of  the  mobility 
characteristics  of  the  ALV  and  the  ERIM  scanner  tire  ade 
ijiiate  to  perceive  the  scale  and  type  of  cross-country  terrain 
shown  in  Figure  1.  We  also  briefly  discuss  the  hierarchy 
for  the  supporting  perception  and  planning  systems  and 
present  results  from  an  actual  cross-country  experiment  at 
Martin  Marietta  Aerospace  Corporation  (MMAC)  in  Den¬ 
ver  during  August,  1987.  Finally,  we  outline  several  new 
enhancements  which  have  been  tested  in  simulation. 

2.  CROSS-COUNTRY  PERCEPTION  SYSTEM 

In  our  recent  experiments,  cross  country  navigation  is  car 
lied  out  by  a  hierarchical  control  system  (Daily  et  al 
19SK).  At  the  highest  level  of  the  hierarchy  resides  a  map 
based  plannei  pioviding  goal  information  obtained  from 
digital  terrain  maps  to  the  next  lower  level  in  the  hierarchy, 
the  local  planner.  At  the  lowest  h-vel  ot  the  hieiarchy  is  the 
lellexive  planner  which  perfol  ills  lea!  time  cont  ml  of  the  ve 
hide.  I  he  local  planner  ensures  that  the  reflexive  plannei 
maintains  progiess  toward  subgoals  A  similarly  liierni 
>  he  al.  limit  i  layered  perception  system  inleiacts  with  each 
i  on  e  ponding  level  of  the  planning  hierarchy  In  t  his  papel , 
We  all  concerned  only  with  the  lowest  level  of  tile  pel  cep 
lion  vsti-iii  In  this  level,  sensing  and  piocessing  units, 
called  vutual  si-iisi  >i  s  ,  detect  specific  envuoiiiuenta]  fea 
'•lie--  and  |('lav  mtoi  mat  ion  about  features  to  the  iefle\i\e 


2.1  CARTESIAN  ELEVATION  MAP 


Figure  1.  Typical  cross-country  terrain  for  the  ALV. 

behaviors  in  the  lowest  level  of  the  planning  system.  Virtual 
sensors  provide  information  by  combining  physical  sensor 
output  with  appropriate  processing  algorithms  in  a  man¬ 
ner  transparent  to  the  requesting  behavior.  In  this  way, 
the  vehicle  reacts  to  obstacles  in  its  environment  at  real 
time  speeds  in  a  manner  consistent  with  its  planned  goals. 

Virtual  sensors  process  data  output  from  physical  sensors 
on  board  the  vehicle.  The  ALV'  is  equipped  with  an  ERIM 
laser  range  scanner  which  measures  the  distance  along  the 
line  of  sight  to  the  nearest  object.  Distance  (or  “range”) 
is  actually  computed  by  measuring,  at  specified  angular  in¬ 
tervals,  the  phase  shift  between  the  active  and  reflected 
laser  signals.  In  the  range  scanner  used  on  the  ALV,  the 
maximum  detectable  phase  shift  corresponds  to  a  physi¬ 
cal  distance  of  64  feet.  Additionally,  since  phase  shifts  are 
measured  modulo  2rr ,  all  recorded  distances  are  modulo  64 
feet.  The  range  resolution  of  the  scanner  is  3  inches;  it  has 
a  horizontal  field  of  view  of  80°  and  vertical  field  of  view 


Figure  2.  Range  image  of  cross-country  terrain. 


of  30”.  The  actual  range  image  is  a  rectangular  array  of  64 
rows  by  256  columns  containing  8-bit  range  values.  Figure 
2  shows  a  sample  range  image  with  range  values  encoded 
a.s  brightness  |i.e.  darker  values  are  closer  to  the  sensor). 
Tin1  scene  contains  a  large,  dee])  ravine  with  trees  on  the 
left  and  gently  sloping  grassy  fields  on  the  right.  Though 
the  sensor  supplies  information  which  inherently  contains 
surface  geometries,  reclamation  of  surface  description  is  dif¬ 
ficult  in  complex  outdoor  imagery.  In  the  following  sections 
we  describe  methods  for  transforming  range  images  into  a 
simpler  representation  and  the  use  of  this  representation 


There  are  numerous  approaches  for  detecting  obstacles 
in  range  imagery  which  perform  calculations  in  the  an¬ 
gle/angle  coordinate  system  of  the  range  sensor  (Tseng  et 
al.  1986)  (Daily,  Harris,  Reiser  1987)  (Hebert  and  Kanade 
1985).  However,  we  have  encountered  several  problems 
with  conventional  image  processing  methods  which  oper¬ 
ate  in  the  plane  of  the  range  image.  For  example,  since  the 
sensor  moves  the  scanning  ray  over  the  environment  at  uni¬ 
formly  incremented  angles,  areas  near  the  sensor  are  much 
more  densely  sampled  than  those  areas  farther  away  from 
the  scanner.  A  small  neighborhood  of  pixels  at  the  top 
of  the  range  image  typically  represents  an  area  in  three- 
space  two  orders  of  magnitude  larger  than  the  same  size 
neighborhood  at  the  bottom  of  the  image.  Consequently, 
convolving  a  range  image  with  a  mask  of  constant  size  is 
equivalent  to  convolving  a  Cartesian  representation  of  the 
same  image  with  a  mask  which  enlarges  and  warps  as  it 
moves  away  from  the  scanner.  Thus,  features  of  uniform 
size  in  the  real  world  cannot  easily  be  detected  with  mask 
operators  of  constant  size  in  range  images.  A  second  diffi¬ 
culty  arises  when  attempting  to  fuse  scans  of  the  same  scene 
taken  from  different  viewing  positions.  Since  range  images 
have  fine  resolution  for  close  objects  and  coarse  resolution 
for  distant  objects,  small  changes  in  viewing  position  create 
large  changes  in  the  scale  and  resolution  of  scene  objects. 
Consequently,  fusion  of  multiple  scans  is  extremely  prob¬ 
lematic  in  the  angle/angle  space  of  range  images. 

The  Cartesian  Elevation  Map  (C'EM)  is  an  alternate  rep¬ 
resentation  for  range  information  in  which  data  from  the 
viewer  centered  coordinate  system  of  a  range  sensor  is 
transformed  into  a  Cartesian  z(x,y)  coordinate  system. 
This  results  in  a  down-looking,  map  view  representation 
of  terrain  which  is  useful  for  autonomous  navigation.  This 
same  representation  may  be  obtained  from  other  depth  sen¬ 
sors,  such  as  stereo  or  sonar. 

The  first  step  in  converting  from  laser  range  data  to  the 
CEM  is  to  calculate  the  x,y,z  Cartesian  values  from  each 
range  measurement,  p(8,<i>),  throughout  the  range  image 
using  the  tollowing  transformations: 

,r  =  p(  8.  0)sin9 
y  =  p(9.  6)cos9cos$ 
z  —  p(9 ,  0)co*9sin$ 

,r.  y  and  c  represent  the  crossrange,  downrange,  and  eleva¬ 
tion  coordinates,  respectively.  In  the  original  range  image. 
<P  corresponds  to  the  depression  angle  of  a  scan  ray  and  9 
indicates  the  sideways  deflection  of  the  ray.  measured  in  the 
plane  of  o.  f  Optics  in  the  sensor  cause  scan  rays  to  diverge' 
as  they  travel  away  from  the  sensor.  When  a  divergent  ray 
falls  on  an  object  at  some  arbitrary  angle,  it  illuminates  an 

The  coordinate  system  can  be  transformed  to  a  true  splier 
ical  coordinate  system  by  renaming  O  9.9  •  o. 


elliptically  shaped  area  often  referred  to  as  a  laser  “foot¬ 
print’.  Distance  to  the  footprint  is  a  reflectance  weighted 
average  over  the  illuminated  area.  Each  of  the  3D  points 
in  the  CEM,  as  derived  from  the  above  formulas,  denotes 
the  approximate  location  of  the  center  of  a  footprint. 

Figure  3a  shows  the  actual  (x,  y)  positions  of  each  one  of  the 
scanned  points  for  the  image  in  Figure  2  within  a  64  foot 
by  80  foot  region  in  front  of  the  scanner.  Elevation  data 
{z  values)  are  present  only  at  these  sparse  points.  As  one 
would  expect,  the  sparsity  of  scanned  points  increases  with 
distance  from  the  scanner.  We  approximate  the  complex 
laser  footprint  scanning  procedure  as  a  smoothing  followed 
by  a  sampling  process.  Theoretically,  if  the  terrain  were 
sampled  well  enough  (i.e.  within  the  Nyquist  rate),  then 
we  could  accurately  reconstruct  the  original  terrain  by  in¬ 
terpolating  a  smooth  surface  between  scanned  points. 

Clearly,  there  will  always  be  some  unknown  regions  in  which 
there  were  not  enough  scanned  points  to  accurately  recon¬ 
struct  a  surface.  For  example,  any  region  outside  the  field 
of  view  of  the  scanner  or  in  the  shadows  of  tall  features 
in  the  terrain  will  be  unknown.  These  areas  must  be  lo¬ 
cated  and  excluded  from  the  interpolation  process.  Figure 
3b  shows  the  regions  where  the  density  of  scanned  points 
was  deemed  too  low  to  accurately  reconstruct  a  surface. 


Figure  3.  a)  Constraints  points  b)  Known  areas  c) 
Fully  interpolated  CEM. 

An  iterative  interpolation  algorithm  was  used  to  fill  in  a 
continuous  surface  in  all  regions  with  a  sufficiently  dense 
sampling  of  points.  This  algorithm  is  similar  to  the  st.an 
dard  routines  used  for  interpolating  surfaces  from  sparse 


stereo  data  (Crimson  1981)  (Terzopoulos  1983).  Rather 
than  fitting  a  thin  plate  to  the  surface  (i.e.  minimizing  the 
quadratic  variation),  we  have  found  that  fitting  an  elas¬ 
tic  membrane  is  satisfactory.  Intuitively,  this  algorithm  is 
equivalent  to  fitting  a  rubber  sheet  over  the  set  of  con¬ 
straint  points  and  finding  the  resultant  minimal  energy  sur¬ 
face.  Figure  3c  depicts  the  final  interpolated  CEM,  where 
brighter  areas  denote  higher  elevations.  Figure  4  shows  a 
3D  plot  of  the  same  CEM. 


Figure  4.  3D  view  of  CEM. 


We  have  discussed  the  CEM  in  terms  of  the  ERIM  laser 
range  scanner,  but  in  fact  the  CEM  can  be  generated  trom 
any  range  sensor.  For  example,  we  anticipate  using  stereo 
cameras  on  the  ALV  to  produce  range  data  for  CEM  con¬ 
struction.  The  algorithms  we  use  to  process  the  CEM  (de¬ 
scribed  in  the  following  sections)  can  be  used  without  re¬ 
gard  to  the  original  range  sensor.  The  following  section 
discusses  how  the  CEM  concept  is  used  to  fuse  data  from 
multiple  scans.  The  CEM  has  also  been  used  to  aid  in  the 
combination  of  data  from  multiple  sensors  such  as  a  range 
sensor  and  a  color  camera  (Daily  et.  al.1987). 

2.2  CEM  FUSION 

Duo  to  the  limited  vortical  field  of  view  of  the  ERIM  s.  lisor 
(30  degrees),  terrain  immediately  in  front  of  the  sensor  will 
not  be  visible.  For  example,  the  ALV  scanner  cannot,  see 
obstacles  directly  in  front,  of  the  vehicle  due  to  the  height 
and  tilt,  of  the  laser  range  scanner.  The  closest,  scanned 
ground  is  approximately  13  feet  in  front  of  the  vehicle.  One 
way  to  fill  in  the  unknown  area  is  to  use  data  from  previ¬ 
ous  C’EM’s.  Fusion  of  C'EM’s  requires  exact  knowledge  of 
the  vehicle  displacement  in  six  degrees  of  freedom  between 
range  images.  Orientation  sensors  on  board  the  ALV  record 
the  vehicle’s  heading,  pitch,  roll,  x  and  y  values.  A  simple 
estimate  of  the  change  in  c  between  two  C'EM's  is  calcu¬ 
lated  by  taking  the  difference  of  the  average  c  value  of  a 
small  patch  of  ground  in  one  CEM  and  the  average  c  value 
of  the  same  patch  of  ground  in  the  other  CEM.  Provided 
the  output  of  the  orientation  sensors  and  delta  ;  estimates 
are  accurate,  CEM  s  can  be  reliably  fused.  Wo  are  are  cur 
rent ly  invest igating  the  recovery  of  the  six  degree  of  freedom 
motion  from  sequences  of  range  images. 


.N.v.v  ,v  vs.  .  s  -  »  - 


.*  /  /  .• 

A 


2.3  USING  A  VEHICLE  MODEL  TO  FIND  OB¬ 
STACLES  IN  THE  CEM 

The  most  conclusive  way  to  discern  whether  a  certain  patch 
of  terrain  is  an  obstacle  is  to  literally  drive  a  vehicle  over 
the  area  in  question.  In  lieu  of  such  a  time  consuming  and 
potentially  costly  technique,  we  have  developed  a  relatively 
sophisticated  3D  model  of  the  ALV  that  we  apply  to  the 
CEM  with  various  locations  and  headings.  The  interaction 
of  this  vehicle  model  and  the  CEM  yields  a  formal  definition 
of  obstacles,  thus  avoiding  ad  hoc ,  incomplete  definitions. 


OK  BAD  SUSPENSION  BAD  SLOPE  BAD  CLEARANCE 


Figure  5.  Spring  model  of  vehicle. 


The  position  and  orientation  of  the  vehicle  at  any  point  in 
the  CEM  is  determined  by  minimizing  the  energy  of  the 
suspension  springs  associated  with  each  wheel  of  the  vehi¬ 
cle.  As  shown  in  two  dimensions  in  Figure  5,  three  types 
of  obstacles  are  detected  with  the  vehicle  model: 

Suspension  obstacle:  detected  when  any  of  the  eight 
wheel  springs  is  stretched  or  compressed  beyond  its  sus¬ 
pension  tolerance. 

Slope,  obstacle:  occurs  when  the  vehicle  tips  further 
than  can  be  safely  allowed. 

Clearance  obstacle:  signalled  whenever  any  part  of  the 
terrain  juts  up  into  the  bottom  mea  of  the  %'chicle  car 
riage. 

With  more  processing  time,  the  model  can  be  extended 
to  include  constraints  such  as  weight  distribution,  weather 
conditions,  risk  factors,  vehicle  speed,  and  dynamics.  By 
formally  defining  an  obstacle  in  terms  of  a  particular  ve¬ 
hicle's  characteristics  such  as  clearance  and  suspension,  we 
can  apply  this  model  of  the  vehicle  directly  to  the  sensed 
terrain  to  locate  traversable  regions. 

2.4  THE  VEHICLE  MODEL  TRAJECTORY 
METHOD 

The  vehicle  model  allows  us  to  produce  a  three  dimensional 
configuration  space  by  applying  the  model  at  each  possible 
position  and  heading.  In  most  cases  a  complete  traversabil- 
ity  map  of  the  entire  sensed  area  is  not  needed.  Because  of 
the  processing  time  required  for  the  construction  of  such  a 
traversability  map,  we  have  developed  a  technique  for  ap¬ 
plying  the  model  only  at  those  places  necessary  to  fulfill 
requests  issued  from  the  planner.  In  the  context  of  virtual 
sensors  (briefly  mentioned  in  Section  2).  we  use  the  vehi¬ 
cle  model  to  produce  specific  information  for  traversabil¬ 


ity  at  fast  update  rates.  This  method,  called  the  Veliicle 
Model  Trajectory  (VMT)  virtual  sensor,  simply  calculates 
the  projected  heading  of  the  vehicle  at  each  point  along 
a  linear  trajectory  and  applies  the  model  at  that  heading 
and  location  until  it  either  reaches  the  end  of  the  trajectory 
or  assumes  an  unstable  configuration.  The  distance  trav¬ 
elled  (or  “safe  distance”)  and  the  reason  for  stopping  are 
returned.  Figure  6  shows  the  result  of  applying  the  vehicle 
model  at  seven  linear  trajectories  in  the  CEM. 

For  use  on  the  ALV,  we  implemented  four  versions  of  the 
Vehicle  Model  Trajectory  virtual  sensor,  each  providing  a 
different  level  of  accuracy  and  speed.  Three  parameters 
may  be  varied  independently:  CEM  size  and  resolution,  ve¬ 
hicle  model  accuracy,  and  sampling  frequency  of  the  range 
image.  In  applying  the  vehicle  model,  computation  time 
can  be  significantly  improved  by  processing  every  other 
pixel  along  a  given  trajectory  and  performing  the  clear¬ 
ance  test  at  a  lower  resolution.  In  our  recent  experiments 
(described  in  the  next  Section)  we  typically  used  a  version 
with  one  foot  per  pixel  resolution.  This  version  used  every 
other  pixel  while  applying  the  vehicle  model  in  the  CEM. 


Figure  6.  Vehicle  Model  Trajectory  virtual  sensor 
applied  to  CEM. 


The  total  processing  time  from  image  acquisition  to  trajec¬ 
tory  output  was  approximately  seven  seconds  on  a  Sun  3 
with  a  floating  point  coprocessor. 

2.5  EXPERIMENTS  WITH  THE  ALV 

During  the  months  of  July  and  August  ,  1987  we  successfully 
demonstrated  our  obstacle  detection  and  avoidance  algo¬ 
rithms  on  board  the  Autonomous  Land  Vehicle  at.  MM  AC. 
The  perception  system  consisted  of  the  CEM  and  VMT  vir¬ 
tual  sensor  code  running  on  a  Sun  3  resident  on  the  ALV 
producing  the  maximum  safe  distance  along  seven  linear 
trajectories.  This  information  was  used  by  a  corresponding 
planning  system  to  avoid  obstacles.  A  map  based  planner 
provided  start,  subgoal,  and  goal  points  while  a  local  plan¬ 
ner  monitored  progress  of  the  vehicle  toward  each  subgoal. 
All  planning  modules  resided  on  an  off  board  Lisp  machine 
which  communicated  with  the  perception  system  and  vehi¬ 
cle  control  systems  over  a  radio  link. 
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For  these  experiments,  we  chose  a  site  rich  witli  obstacles 
and  potential  paths  for  obstacle  avoidance.  The  area  was  a 
hillside  containing  steep  slopes  (some  over  15  degrees),  rock 
outcrops,  large  scrub  oaks,  and  small  junipers  (25  inches 
high),  as  well  as  a  narrow  sign  post.  In  addition,  the  area 
included  complex  ravines,  caused  by  rain  runoff,  spanning 
a  range  of  widths  from  C  to  24  inches  and  depths  from  4  to 
30  inches.  The  soil  was  loose  and  sandy  in  the  vicinity  of 
the  ravines,  as  well  as  on  portions  of  the  hillside.  The  ALV 
can  traverse  slopes  with  vehicle  pitch  and  roll  of  less  than 
15  degrees  making  the  hill  generally  navigable  except  for 
outstanding  terrain  features.  The  remaining  surface  con¬ 
sisted  of  grass  which  was  mowed  to  be  less  than  9  inches 
tall,  as  required  by  fire  safety  codes.  The  difficulties  as¬ 
sociated  with  this  cross-country  terrain  were  sufficient  to 
require  caution  by  a  human  driver. 

Figure  7  shows  a  sequence  of  nine  range  images  end  the 
corresponding  CEM  and  vehicle  model  trajectories.  The 
ALV  was  running  completely  autonomous  using  a  simple 
on  board  behavior  which  chose  the  longest  trajectory  bi¬ 
ased  toward  a  goal  heading.  The  range  images  were  ac¬ 
quired  from  the  ERIM  laser  scanner  at  approximately  8 
second  intervals  with  the  ALV  moving  at  0.5  meters  per 
second.  The  terrain  is  steeply  sloped  in  places  (close  to 
15  degrees)  which  causes  the  vehicle  and  scanner  to  tilt 
forward,  shortening  the  effective  visibilty,  A  large  rock 
outcrop  surrounded  by  trees  is  visible  on  the  left  for  the 
duration  of  the  sequence.  The  small  object  visible  in  the 
second  through  fourth  frames  is  a  juniper  bush  about  three 
fei't  tall,  which  was  adequately  avoided  by  the  on  board 
behaviors  controlling  the  vehicle. 

3.  ENHANCEMENTS 

As  described  in  Section  2,  the  vision  system  currently  cal 
rulates  the  clear  distance  along  each  trajectory  (eight  bits 
of  information  per  trajectory).  Since  seven  trajectories  are 
computed,  a  total  of  50  bits  of  information  are  used  by  the 
planner  for  each  range  scan.  Though  the  system  works  re 
markably  well,  we  have  not  iced  several  problems  due  to  t  his 
paucity  of  data  flow  between  the  perception  and  planning 
modules.  The  planner  has  no  indication  of  how  dangerous 
a  clear  trajectory  really  is.  For  example,  the  planner  may 
unknowingly  choose  fairly  si .  ep  ( i.e.  near  15  degrees )  pat  •>“ 
or  t rajectorios  that  pass  very  close  to  obstacles.  'I  hese  po 
Initially  dangerous  paths  are  technically  not  obstacles,  but 
there  may  be  safer  routes  available.  We  have  designed  a 
gradient  technique  for  obstacle  detection  and  a  downlook 
jug  f 'artesian  map  of  cos..,  for  vehicle  Iraversability  called 
the  ('artesian  Weight  Map  (CWM)  in  order  to  extract  and 
combine  more  information  from  the  range  data.  With  this 
addi  d  information,  the  planner  will  navigate  more  intelli 
gently. 

3.1  VEHICLE  MODEL  IMPROVEMENTS 

In  Section  2.4.  We  dis<  Ussed  the  application  of  the  vehicle 
model  using  seven  linear  traje.  tories  at  predefined  head 
mgs  in  the  CEM  However,  the  plannei  controls  the  vein 


Figure  7.  Sequence  of  range  images  and  OEM’s 
with  vehicle  model  trajectories,  in  order  from  left 
to  right  and  top  to  bottom.  On  the  left  is  a  large 
rock  outcrop  with  tall  trees,  while  a  small  juniper 
is  visible  in  frames  two  through  four.  Vehicle  pitch 
and  roll  vary  from  5  to  Mi  degrees. 


rlr  through  speed  and  turn  rate  commands  whhdi  result  in 
curved  trajectories.  Furthermore,  subsequent  the  cal 
dilation  of  the  (’EM  from  a  particular  range*  image,  the 
vehicle  model  lest  was  performed  beginning  at  the  vehi 
eje’s  location  when  the  range  image  was  acquired.  Figure 
■S  shows  a  (’EM  with  several  curved  trajectories  generated 
using  speed  and  turn  rate  and  applied  at  the  location  of  the 
vehicle  when  vehicle  model  processing  began.  These  capn 
bilit  ies  ate  important  since  eventually  we  hope  to  be  able  to 
compute  the  (’EM  asynchronously  at  frame  rates  (one  half 
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second  for  the  ERIM  scanner)  and  run  the  vehicle  model 
on  the  most  current,  available  CEM.  Along  these  lines,  we 
have  developed  a  virtual  sensor  which  merely  checks  the 
current  path  of  the  vehicle  as  fast  as  possible  and  returns 
the  safe  distance  available  along  that  path  to  the  planner. 
The  CEM  is  calculated  asynchronously  in  a  separate  pro¬ 
cess  on  the  Sun  and  passed  to  the  virtual  sensor  which  con¬ 
stantly  checks  the  current  path  on  the  most  recent  CEM, 
updating  the  vehicle  position  as  it  moves.  In  this  way,  the 
planner  is  never  moving  the  vehicle  blindly  along  its  current 
direction  for  more  them  a  fraction  of  a  second. 


GoG  obstacle  detection  in  the  CEM  is  analogous  to  edge  de¬ 
tection  in  video  images.  Edge  detection  in  images  typically 
requires  smoothing  and  differentiation.  The  GoG  method 
first  smooths  the  CEM  with  an  appropriately  sized  Gaus¬ 
sian,  and  then  thresholds  the  magnitude  of  the  gradient. 
We  choose  the  threshold  value  to  be  15  degrees,  the  max¬ 
imum  slope  that  the  ALV  can  safely  traverse  in  any  direc¬ 
tion.  The  standard  deviation  of  the  Gaussian  must  be  large 
enough  to  smooth  out  high  frequency  objects  that  are  not 
obstacles  (i.e.  small  things  like  cans)  and  yet  allow  major 
terrain  features  which  are  obstacles  to  remain.  We  have 


empirically  determined  that  a  standard  deviation  of  three 
feet  yields  an  obstacle  detection  performance  similar  to  that 
of  our  vehicle  model.  An  application  of  the  GoG  technique 
to  the  example  CEM  is  shown  in  Figure  9. 


3.2  GRADIENT  OF  GAUSSIAN  (GOG)  OBSTA¬ 
CLE  DETECTION 

Our  current  VMT  strategy  simply  tells  the  planner  how  far 
the  vehicle  may  go  along  a  given  trajectory.  The  planner 
has  no  idea  how  close  the  trajectory  passes  by  an  obstacle. 
In  simulation  and  in  real  life  the  vehicle  tends  to  pass  too 
close  to  obstacles  since  no  part  of  our  current  system  deals 
with  side  clearance.  Using  a  wider  vehicle  in  the.  vehicle 
model  test  would  not,  be  a  satisfactory  solution  since  this 
would  change  the  definition  of  all  kinds  of  obstacles.  Also, 
a  wider  vehicle  model  would  never  allow  the  actual  vehi¬ 
cle  to  pass  through  a  narrow  opening.  We  could  ease  this 
problem  by  searching  additional  trajectories  on  either  side 
of  the  chosen  trajectory.  As  mentioned  above,  curved  tra¬ 
jectories  should  be  searched.  Rather  than  expand  our  sys¬ 
tem  to  search  the  large  set.  of  possible  curved  trajectories 
with  the  computationally  expensive  vehicle  model,  we  have 
developed  the  Gradient  of  Gaussian  (GoG)  obstacle  detec¬ 
tion  technique.  GoG  returns  an  approximate  tertiary  map 
which  marks  traversable,  nontraversable,  and  unknown  ar¬ 
eas.  This  representation  can  be  used  to  quickly  generate  a 
few  trajectories  of  imprest,  which  are  then  verified  by  the 
vehicle  model. 


Figure  9.  Gradient  of  Gaussian  on  CEM  from  Fig¬ 
ure  3. 

Since,  the  GoG  technique  is  a  gross  simplification  of  the 
vehicle  model,  we  require  that  potential  paths  and  the  final 
chosen  path  be  checked  by  the  vehicle  model.  Furthermore, 
edge  detection  techniques,  like  GoG,  have  difficulties  with 
the  localization  of  features.  That  is,  when  an  image  (or 
CEM)  is  smoothed,  the  location  of  the  real  edge  (or  obsta¬ 
cle)  is  difficult  to  determine  without  additional  processing. 
This  smearing  of  obstacles  is  usually  not  a  problem  since  the 
vehicle  normally  stays  far  from  obstacles.  However,  when 
the  vehicle  must  pass  through  narrow  areas  where  accurate 
localization  is  necessary,  the  constrained  space  is  searched 
with  the  vehicle  model  to  find  clear  pathways. 

3.3  CARTESIAN  WEIGHT  MAP 

We  originally  developed  the  Cartesian  Weight  Map  (CWM) 
to  prevent  the  vehicle  from  passing  too  close  to  obstacles. 
Like  the  CEM,  the  CWM  is  a  down-looking  map  in  the  lo¬ 
cal  coordinate  system  of  the  vehicle.  A  single  pixel  value  in 
the  CWM  contains  a  weight,  identifying  the  cost,  of  travers¬ 
ing  the  corresponding  pixel  in  the  CEM.  Obstacles  found 


using  GoG  are  given  an  infinite  weight  in  the  CWM  so 
that  no  path  will  ever  travel  through  an  obstacle.  To  solve 
the  problem  of  the  vehicle  straying  too  close  to  obstacles, 
all  other  CWM  pixels  are  given  weights  which  exponen¬ 
tially  decay  with  distance  from  the  nearest  obstacle.  These 
weights  are  set  by  using  a  modified  medial  axis  transform 
(or  skeletonization)  operation  that  consists  of  local  propa¬ 
gation  among  neighboring  pixels.  A  dynamic  programming 
algorithm  (Dijkstra’s)  is  used  to  find  the  least  cost  path 
from  the  current  vehicle  position  in  the  CWM  to  every  other 
pixel  in  the  CWM.  We  have  simulated  a  suggested  path  for 
our  example  CEM  as  shown  in  Figure  ln  We  have  tested 
the  use  of  the  CWM  in  a  sophisticated  simulation  environ¬ 
ment  and  found  that  the  CWM  technique  safely  guides  the 
vehicle  equidistance  between  obstacles  and  avoids  small  cul 
de  sacs. 


Figure  10.  Least  cost  path  on  CEM  from  Figure  3. 

So  far  we  have  discussed  the  use  of  the  CWM  in  terms  of 
a  single  virtual  sensor,  nearest  obstacle  distance.  However, 
the  CWM  provides  a  general  framework  for  the  combina¬ 
tion  of  various  virtual  sensors  that  deal  with  navigation. 
For  instance,  one  virtual  sensor  could  penalize  some  areas 
in  the  CWM  which  are  bumpy  and  reward  the  regions  that 
are  smooth.  Perhaps  another  virtual  sensor  could  update 
the  CWM  based  upon  the  slope  of  the  ground.  This  sys¬ 
tem  of  multiple  virtual  sensors  concurrently  updating  the 
CWM  (shown  in  Figure  11)  is  consistent  with  our  hierarchi¬ 
cal  planning  system.  To  combine  slope  and  nearest  obstacle 
distance  into  a  single  weight  we  have  to  answer  questions 
such  as,  “Would  you  rather  take  a  path  that  passes  within 
2  feet  of  an  obstacle  or  take  a  path  that  rolls  the  vehicle  10 
degrees?”  The  local  planner  is  best  able  to  deal  with  ques¬ 
tions  of  this  type.  The  local  planner  may  decide  to  use  a 
combination  function  that  fuses  the  different  types  of  infor¬ 
mation  into  one  weight  at  each  pixel.  This  combiner  func¬ 
tion  would  be  a  nonlinear  combination  of  the  outputs  from 
different  virtual  sensors.  The  combiner  function  should  not. 
be  fixed,  but  will  change  with  various  environments,  speeds, 
and  mission  objectives.  As  before,  a  standard  dynamic  pro¬ 
gramming  algorithm  can  be  used  to  find  the  minimum  cost 
path  through  the  CWM. 


Figure  11.  Updating  the  Cartesian  Weight  Map. 


3.4  PARALLELISM 

A  great  advantage  of  the  CEM  and  CWM  systems  is  that 
they  are  extremely  parallel.  In  cooperation  with  MIT  in 
March,  1987,  we  implemented  the  CEM  construction  algo¬ 
rithm  on  the  Connection  Machine  (Hillis  1985).  Creating 
a  128  x  128  (6  inch  resolution)  CEM  from  a  range  image 
took  less  than  0.5  seconds  on  the  CM-1.  The  unoptimized 
Lisp  code  was  floating  point  intensive,  so  we  expect  a  ma¬ 
jor  speed  increase  when  we  have  access  to  a  CM-2  (the 
next  generation  Connection  Machine  with  special  floating 
point  chips  and  other  enhancements).  During  our  two  week 
programming  spree  on  the  Connection  Machine,  we  had  ex¬ 
tra  time  to  develop  the  GoG  obstacle  detection  technique 
which  requires  only  an  additional  8  milliseconds  of  comput¬ 
ing  time. 

We  believe  that  the  virtual  sensors  for  finding  distance  from 
obstacles,  bumpiness  of  terrain,  and  steepness  will  also  op¬ 
erate  on  the  order  of  milliseconds  on  the  Connection  Ma¬ 
chine.  If  each  pixel  in  the  CWM  is  allocated  a  processor, 
Dijkstra's  algorithm  is  also  extremely  fast.  Our  conserva¬ 
tive  estimate  is  that  256  iterations  (longest  path  length)  of 
this  algorithm  using  integer  weights  will  take  less  than  200 
ms  on  the  Connection  Machine. 

4.  CONCLUSIONS  AND  FUTURE  PLANS 

We  have  presented  an  operational  perception  system  for 
cross-country  navigation  which  has  been  verified  in  both 
simulated  and  real  world  environments.  Range  image  data 
is  transformed  into  an  alternate  representation  called  the 
Cartesian  Elevation  Map.  Virtual  sensors,  including  those 
which  use  a  detailed  vehicle  model,  operate  on  the  CEM  to 
produce  t.raversability  information  along  selected  trajecto¬ 
ries.  This  information  supports  a  real-time  reflexive  plan¬ 
ning  system.  Enhancements  to  the  current,  implementation 
of  the  perception  system  include  curved  trajectories  for  the 
vehicle  model  and  the  Gradient  of  Gaussian  obstacle  detec¬ 
tion  and  Cartesian  Weight  Map  techniques. 


.  *  „  -j 


Our  immediate  plans  include  further  testing  during  addi¬ 
tional  experiments  on  the  ALV  in  November,  1987.  We  are 
currently  porting  our  algorithms  to  the  WARP  multiproces¬ 
sor  which  will  be  available  on  the  ALV  in  the  near  future. 
We  also  hope  to  incorporate  the  CWM  into  the  perception 
system,  as  well  as  develop  new  virtual  sensors  which  use 
range  and  color  imagery.  Higher  levels  in  the  perception 
system  will  also  be  developed  to  use  the  output  of  virtual 
sensors  for  object  recognition. 
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Abstract 


The  SRI  Cartographic  Modeling  Environment  has  been  created 
to  support  research  on  interactive,  semiautomated,  and  auto¬ 
mated  computer-based  cartographic  activities.  The  underly¬ 
ing  image  manipulation  capabilities  are  provided  by  the  SRI 
ImagCalc™  system.  The  cartographic  features  and  data  that 
can  be  entered  include  multiple  images,  camera  models,  dig¬ 
ital  terrain  elevation  data,  point,  line,  and  area  cartographic 
features,  and  a  wide  assortment  of  three-dimensional  objects. 
Interactive  capabilities  include  free-hand  feature  entry,  alter¬ 
ing  features  while  constraining  them  to  conform  to  the  terrain 
and  lighting  geometry,  adjustment  of  feature  parameters,  and 
the  adjustment  of  the  camera  model  to  display  the  scene  fea¬ 
tures  from  arbitrary  viewpoint?.  Cartographic  features  are  de- 
pictable  either  as  wire-frame  sketches  for  interactive  purposes  or 
as  texture-mapped  renderings  for  realistic  scene  synthesis.  High- 
quality  simulated  scenes  are  created  by  texture-mapping  images 
onto  terrain  data  and  adding  renderings  of  cartographic  features 
using  depth-buffering  and  anti-aliasing  techniques.  Motion  se¬ 
quences  can  be  created  by  choosing  a  series  of  camera  models 
and  rendering  the  simulated  appearance  of  the  scene  from  each 
viewpoint. 


1  Introduction 


Among  the  cartographic  applications  upon  which  computer- 
based  techniques  can  have  a  substantial  impact  are  the  following: 


•  Cartographic  compilation.  Manual  cartographic  compi¬ 
lation  is  expensive  and  traditional  techniques  limit  the  ways 
in  which  constraints  and  previously-compiled  information 
can  be  incorporated  into  the  process.  The  task  of  com¬ 
pilation  requires  the  construction  of  world- registered  three- 
dimensional  models  from  a  body  of  sensor  data  and  geomet¬ 
ric  information  sources.  Such  information  is  now  used  not 
only  for  the  creation  of  paper-based  cartographic  products, 
but  also  for  the  establishment  of  digital  databases  that  are 
used  for  two-dimensional  cartographic  depictions  and  t  hree- 
dimensional  tactical  simulations.  Without  automation,  the 
increasing  demands  for  timely,  detailed  cartographic  infor¬ 
mation  will  vastly  exceed  the  resources  available. 


End-user  consumption  of  cartographic  products.  A 

growing  number  of  end-users  need  to  interact  directly  with 
digital  data,  without  requiring  the  use  of  paper  media. 


'This  research  was  supported  by  thf*  Defense  Advanced  Research  Projects 
Agency  under  Contract  No  MDA9n.3-R6-C.0084  Some  aspects  of  this 
research  were  also  supported  by  National  Science  Foundation  Grant  No. 
1ST-R51 1751 


These  same  end-users  must  be  able  to  select,  customize, 
and  update  the  distributed  data  to  meet  rapidly  changing 
situations  or  unforeseen  requirements.  These  needs  have 
led  to  the  concept  of  the  "Deployable  Digital  Database,” 
in  which  digital  media  replace  paper  media  for  information 
storage  and  retrieval. 


Generation  of  mission-planning  and  training  scenar¬ 
ios.  Mission  planning  and  training  can  be  greatly  enhanced 
by  the  ability  to  visualize  spatial  environments.  This  can 
be  accomplished  within  a  computer  graphics  system  by  gen¬ 
erating  synthetic  static  views  and  motion  sequences  incor¬ 
porating  detailed  cartographic  feature  models  and  texture 
maps  of  real  imagery  projected  onto  the  features  and  ter¬ 
rain. 


The  SRI  Cartographic  Modeling  Environment  has  been  de¬ 
veloped  as  a  sophisticated  experimental  research  tool  for  explor¬ 
ing  three-dimensional  modeling  tasks.  While  the  system  has 
many  capabilities  in  common  with  other  Computer  Aided  De¬ 
sign  (CAD)  systems,  it  is  particularly  well-suited  for  evaluat¬ 
ing  computer-based  approaches  to  the  cartographic  applications 
noted  above.  It  supports  a  variety  of  interactive  facilities  for  cre¬ 
ating,  editing,  viewing,  and  rendering  three-dimensional  models 
of  world  objects,  cartographic  features,  and  terrain.  The  mod¬ 
eling  capabilities  may  be  used  independently  or  in  conjunction 
with  multiple,  metrically  calibrated  digital  images.  Interaction 
with  geometric  models  is  characterized  by  intuitive  simplicity 
and  by  innovative  techniques  for  exploiting  geometric  and  data- 
driven  constraints  in  the  manipulation  process.  Synthetic  views 
of  a  scene  may  be  constructed  from  arbitrary  viewpoints  using 
terrain  and  feature  models  in  combination  with  texture  maps  ac¬ 
quired  from  aerial  imagery.  Figure  1  gives  a  schematic  overview 
of  the  system’s  organization  and  capabilities. 

This  system  extends  the  two-dimensional  capabilities  of  the 
ImagCalc' ■'*  image  manipulation  system  to  three-dimensional 
space.  The  geometric  parameters  of  cartographic  features  and 
camera  models  are  defined  in  a  three-dimensional  world  coordi¬ 
nate  system.  Each  image  lias  an  associated  camera  model  that 
defines  a  ray  in  space  for  each  pixel  in  the  image.  These  camera 
models  are  used  to  project  world  coordinates  to  image  coordi¬ 
nates:  when  a  world  coordinate  corresponding  to  a  point  in  the 
image  is  required,  the  camera  models  also  serve  to  project  a  ray 
from  the  image  to  the  wor'd.  so  that  its  intersection  with  terrain 
and  object  models  can  be  found. 

Among  the  potential  applications  of  the  system  we  note  the 
following: 


St  inlying  human  interface  techniques  for  enhancing  produc¬ 
tivity  in  the  computer  environment. 
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•  Developing  improved  specifications  for  future,  more  special¬ 
ized  application  environments. 

•  Serving  as  a  flexible  background  environment  in  which 
to  develop  and  test  semiautomated  and  automated  carto¬ 
graphic  analysis  techniques. 

•  Performing  one-of-a-kind  tasks  such  as  creating  a  unique 
set  of  models  to  be  used  in  a  simulated  image,  or  making  a 
demonstration  videotape  of  a  simulated  mission  scenario. 

The  distinguishing  features  that  set  the  Cartographic  Mod¬ 
eling  Environment  apart  from  more  conventional  CAD  systems 
include: 

•  Registration  of  multiple  data  sources,  including  stereo- 
graphic  or  multiple  images,  terrain  elevation  modeis,  and 
three-dimensional  object  models,  to  the  same  world  coor¬ 
dinate  system.  This  capability  is  unique  in  that  it  permits 
object  model  entry  to  be  driven  by  sensor  data  such  as  ac¬ 
tual  images. 

•  Use  of  lighting  models,  terrain  elevation  data,  and  other 
geometric  knowledge  to  constrain  and  facilitate  data  entry. 
The  exploitation  of  constraints  in  the  interactive  modeling 
process  potentially  increases  the  efficiency  of  the  human 
operator. 

•  Registration  of  local  coordinate  systems  to  UT.W's,  latitude- 
longitude,  and  other  cartographic  coordinate  representa¬ 
tions.  The  use  of  real-world  coordinate  systems  enables  the 
system  to  exploit  specific  world  knowledge,  c.g.,  by  comput¬ 
ing  the  sun  position  for  a  particular  location  at,  a  particular 
time  of  day. 


2.1  Data  Entry  Modes 

The  system  supports  three  major  approaches  to  the  entry  of 
cartographic  data: 

•  Manual.  A  human  operator  can  directly  interact  with  the 
computer  system  in  ma.tv  ways,  ranging  from  viewing  a  li¬ 
brary  of  images  and  sketching  cartographic  features  to  cre¬ 
ating  synthetic  scene  views.  In  this  mode,  the  user  typically 
adjusts  the  position  and  parameters  of  a  feature  using  direct 
visual  feedback  from  the  underlying  image  data  sources. 

•  Semiautomated.  The  computer  has  the  ability  to  per¬ 
form  certain  operations  to  enhance  the  human  user's  effec¬ 
tiveness  substantially  while  remaining  under  the  human's 
direct,  interactive  control.  Examples  of  such  operations  are 
the  display  of  registered  world  points  on  multiple  images, 
no  two  of  which  need  be  fusible  as  stereographic  pairs,  the 
display  of  illumination  rays  intersecting  any  world  point, 
and  running  local  feature-extraction  operators  that  depend 
upon  being  given  a  reasonable  starting  cue. 

•  Fully  Automated.  Very  few  cartographic  compilation 
procedures  can  be  fully  automated  using  present  technol¬ 
ogy.  However,  as  automated  techniques  mature,  it  is  ap¬ 
parent  that  they  will  need  to  coexist  with  the  manual  and 
semiautomated  compilation  methods.  Furthermore,  even 
fully  automated  techniques  need  an  environment  in  which 
the  human  initiator  of  the  analysis  process  can  interact  with 
and  select  the  domains  and  goals  of  automated  operation. 
Thus  we  see  our  system  as  a  natural  framework  in  which 
to  perform  the  staged  transition  from  manual  to  automated 
computer-based  cartographic  analysis. 
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The  remainder  of  this  document  is  organized  as  follows:  Sec¬ 
tion  2  describes  the  basic  facilities  that  are  available  in  the  sys¬ 
tem  for  the  interactive  user;  Section  3  gives  an  overview  of  the 
internal  structures  of  the  system  that  would  be  used  by  a  pro¬ 
grammer  to  build  applications  within  the  environment;  Section 
■f  outlines  the  hardware  and  software  characteristics  of  the  cur¬ 
rent  implementation.  The  final  section  concludes  with  brief  sum¬ 
mary. 


2  Interacting  with  the  System 
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The  Cartographic  Modeling  Environment  can  be  used  in  two 
very  different  ways.  This  section  is  devoted  to  a  description 
of  the  interactive  tasks  that  can  tie  carried  out  on  the  system 
with  little  or  no  reference  to  the  programming  environment;  no 
significant  knowledge  of  the  computer  system  itself  is  required, 
since  all  operations  are  earned  out  by  pointing  to  the  graphics 
display  with  the  mouse  and  performing  simple  menu  operations 
with  the  mouse  and/or  keyboard.  In  the  subsequent  section,  we 
describe  the  programming  interface  to  the  system  for  those  who 
need  to  go  beyond  the  facilities  provided  in  the  interactive  user 
interface. 

In  this  Section,  we  first  discuss  the  data  entry  modes  and  car¬ 
tographic  display  capabilities  that  are  supported.  Next,  we  de¬ 
scribe  the  nature  of  the  user  interface,  the  types  of  data  sources 
used,  and  the  components  of  an  interactive  scene  view.  Finally, 
we  give  an  outline  of  some  of  the  interactive  feature  manipula¬ 
tion  activities  possible  within  the  system. 


2.2  End-User  Data  Display  Capabilities 

Facilities  similar  to  those  required  by  end-users  are  supported 
as  well.  Among  the  specific  viewer-oriented  capabilities  are: 

•  Data  Selection.  The  user  can  choose  a  customized  subset 
of  the  features  available,  thereby  uncluttering  the  display 
to  make  his  task  more  efficient. 

•  View  Manipulation.  A  data  set  can  be  viewed  using  an 
arbitrary  camera  model,  whose  position,  orientation,  and 
internal  parameters  arc  controllable  as  those  of  an  aerial 
observation  platform  might  be. 

•  Creation  of  Synthetic  Views.  Once  an  area  of  interest .  a 
data  set.  and  a  camera  view  have  been  selected,  the  system 
can  create  synthetic  views  to  show  the  operator  how  the 
area  might  look  in  real  life  from  the  cIiomui  viewpoint. 

Wo  remark  that  soft-copy  views  constructed  using  such  tools 
have  many  potential  advantages  over  hard-copy  cartographic 
products. 

2.3  User  Interface  Components 

flio  interactive  environment  consists  of  the  following  compo¬ 
nents: 

•  Graphics  Screen.  The  graphics  screen  of  the  workstation 
is  divided  into  a  number  of  windows.  I  ach  window  holds  a 
stack  of  viev.  s,  and  each  view  consists  of  a  r  .uin  ra  model. 


an  optional  image  corresponding  to  the  camera  model,  and 
a  set  of  cartographic  feature  databases.  Wire  frame  models 
of  the  cartographic  features  are  overlaid  on  the  image,  if 
present,  using  the  associated  camera  model  to  project  from 
world  coordinates  to  window  coordinates. 

The  features  themselves  are  displayable  either  in  terms  of 
wire  frames  or  texture-mapped  renderings.  The  wire  frames 
are  easily  moved  around  in  real  time  and  have  special  sensi¬ 
tive  regions  used  as  interactive  handles  for  the  mouse  func¬ 
tions  described  in  the  next  paragraph.  When  realistic  but 
static  grey-scale  images  of  the  scene  are  desired  instead  of 
real-time  manipulation,  each  feature  has  an  alternative  dis¬ 
play  method  that  renders  texture-mapped  faces  into  a  depth 
buffer  with  anti-aliasing. 

•  Mouse  Pointing  Device.  A  variety  of  operations  are  ac¬ 
cessible  by  pressing  either  the  left,  middle,  or  right  mouse 
buttons  while  some  combination  of  the  four  state-modifying 
shift  keys  (Control,  Meta,  Super,  and  Hyper)  is  held  down. 
Depending  on  the  operation,  the  movement  of  the  mouse 
cursor  may  serve  to  change  a  parameter  of  the  object .  When 
an  operation  involves  several  steps,  a  sequence  of  prompts 
for  mouse  and  keyboard  actions  appears  on  the  screen.  Two 
of  the  mouse  operations,  the  Create  Object  function  ana  the 
View  Menu,  are  specifically  for  the  creation  and  manipula¬ 
tion  of  three-dimensional  scene  structures.  The  remainder 
of  tiie  operations  available  directly  from  the  mouse  inter- 
facc.  shown  in  Table  1,  are  standard  IniagC’alc  operations 
that  deal  mainly  with  t lie  world  of  two-dimensional  screens 
and  digitized  image  arrays.  A  superset  of  the  ImagOalc 
mouse-driven  operations  ran  he  invoked  from  an  optional 
pull-down  menu  bar  on  the  main  display  screen.  These  op 
erations  are  detailed  in  the  lmagC'alr  manual  [Qiiam.  UK'S] 
and  will  not  be  described  here. 

When  the  mouse  cursor  touches  a  sensitive  area  ol  an  ob¬ 
ject  "s  interact  ive  depiction,  the  menu  of  opera!  ions  available 
to  the  mouse  immediately  switches  to  one  appropriate  to  the 
object.  A  typical  set  of  smh  operations  is  shown  in  Table 
2  for  a  simple  cube-shaped  object. 

•  Keyboard.  The  keyboard  is  available  for  entering  text 
information.  Typical  operations  in  the  interactive  mode 
would  Involve  typing  in  parameters  on  the  menu  attached  to 
a  specific  object  to  modify  its  current  state  or  configuration, 
or  specifying  a  file  name  for  loading  a  data  set.  Simple  Lisp 
operations  typed  on  the  keyboard  are  interpreted  in  tlm 
Lisp  Listener  window  described  below. 


•  Images.  Digitized  photographs  or  similar  sensor  data  from 
geographic  areas  relevant  to  modeling  and  feature  extrac¬ 
tion  tasks.  Associated  with  the  images  should  be  supple¬ 
mentary  data  such  as  geographic  coordinates,  camera  mod¬ 
els  (or  equivalent  sensor-to-world  transformations),  illumi¬ 
nation  information,  exact  times  that  allow  reconstruction  of 
the  sun  position  for  outdoor  scenes,  sensor  characteristics, 
sensor  response  information,  and  atmospheric  properties  at 
the  time  the  images  were  generated.  We  note  that,  at  the 
current  time,  we  are  aware  of  no  standard  sources  of  data 
that  include  the  necessary  atmospheric  or  sensor  response 
information;  this  lack  severely  handicaps  any  efforts  that 
might  be  made  to  correct  the  photometry  for  atmospheric 
and  film  processing  effects. 

•  Cartographic  Products.  Existing  cartographic  prod¬ 
ucts.  cither  in  the  form  of  digital  feature  data  or  digitized 
images  of  hard-ropy  products,  can  be  used  in  the  same  way 
as  images  to  provide  a  world-registered  context  for  the  entry 
or  edi'ing  of  additional  cartographic  information.  Existing 
low-resolution  data  can  be  extremely  effective  in  guiding  a 
high-resolution  feature-acquisition  task. 

•  Digital  Terrain  Elevation  Data.  When  available,  ter¬ 
rain  elevation  data  should  be  registered  to  the  corre¬ 
sponding  images.  This  permits  the  effective  use  of  three- 
dimensional  constraints  during  feature  entry,  the  construc¬ 
tion  of  accurate  synthetic  images,  and  the  generation  of 
cartographic  products  containing  elevation  information. 

•  Compiled  Feature  Data.  Cartographic  feature  data¬ 
bases  are  needed  for  the  generation  of  cartographic  prod¬ 
ucts  meeting  the  needs  of  tasks  for  which  an  image  alone 
is  insufficient.  Ideally,  the  system  should  support  the  inter¬ 
pretation  of  precompiled  feature  databases,  the  editing  of 
any  item  in  such  a  database,  and  the  interactive  entry  of 
new  database  features. 

•  Feature  Models.  In  order  to  support  the  entry  of  new 
features,  a  library  of  predefined  prototype  feature  models  is 
required.  Such  models  would  bo  used  either  by  an  analyst 
generating  a  distributable  feature  database  product,  or  by 
an  eml  user  updating  an  existing  database.  Feature  models 
must  support  interactive  modes  that  allow  them  to  be  easily 
moved  to  correct  geographic  positions  and  adjusted  with 
respect  to  their  internal  parameters. 

2.5  View  Components 

Each  view  in  the  slack  on  a  window  consists  of  a  fundamental 
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Lisp  Listener.  The  l  isp  Listener  is  a  window  in  which 
keyboard  input  is  passed  to  the  Lisp  interpreter  and  cxe- 
ruted.  Many  mouse  operations  on  windows  return  the  data 
structure  pointed  at  to  the  Lisp  Listener.  In  this  way.  one 
can  assign  symbols  to  individual  images  and  cartographic 
objects  t.o  facilitate  their  later  examination  and  manipula¬ 
tion. 


2.4  Information  Sources 


Among  the  types  of  information  that  the  system  is  designed  t< 
handle  are  the  following: 


set  of  components  tying  together  information  sources  such  as 
those  described  in  section  2.1.  The  components  of  a  view  are 
the  following: 


•  YVorld-to-view  coordinate  transformation: 

•  ( ‘art ographic  object  database  • 


Optional  image,  which  must  correspond  to  the  coordinate 
transformation  if  present; 


Optional  leriain  elevation  m< 


1  he  basic  operations  available  on  the  view  include  the  func¬ 
tions  on  tlie  Mam.  summarized  in  fable  d.  To  begin 
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Zoom  In 
Copy  To  Here 
Kill  Top 
Y-Slicc 
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Describe 
Scroll  Window 
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Mouse  Right 
Menu 
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Move  To  Here 
Kill  Stack 
Histogram 
Clear  Tandem 
Fast  Zoom  Out 
Pop  Multiple 
Magnify  Pane 
Inspect  Stack 
Negate 
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I'cljt  Viewpoint  |  Push  Displayed 


['able  1:  The  view  operations  available  directly  from  the  mouse  when  no  object  is 
selected.  To  get  access  to  the  functions  on  each  line,  the  corresponding  state- 
modifying  keys  must  be  held  down  while  the  mouse  button  is  depressed.  The 
symbols  are  C  for  Control.  \l  for  Meta,  .S'  for  Super,  and  II  for  Hyper. 
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Database  Operations  View  Operations 

Configure  Databases  New  View  Transform 

Add  a  Database  Render  Terrain 

Add  All  Databases  Render  Objects 

Remove  a  Database  Refresh  Window 

Sensitize  a  Database 
Hide  a  Database 
Expose  a  Database 


Table  3:  View  Menu.  The  model-oriented  op¬ 
erations  performable  on  a  view. 


Flat-Surface  Objects 

Curve-Based  Objects 

Box 

Flight  Path 

Building 

Open  Curve 

House 

Closed  Curve 

Ribbon 

Curved-Surface  Objects 

Mensuration  Objects 

Half  Cylinder 

Crosshair 

Cylinder 

Epipolar  Crosshair 

Superellipse 

3D  Axes 

Superquadric 

Ruler 

SuperSketch  Object 

Utility  Objects 

Group  Objects 

3D  Text 

Composite  Object 

DTM  Mesh 

Cartographic  Database 

Camera 

S"n  Ray 

Table  4:  Create-Object  Menu.  The  classes  of  objects 
that  may  be  instantiated. 


working  with  a  view,  one  must  either  create  a  bare  transform 
from  the  View  Menu,  or  manually  load  an  image  file  and  set 
up  the  required  items  on  the  resulting  viewpoint  in  the  window. 
Once  a  view  has  been  established,  one  can  perform  such  opera¬ 
tions  as  adding,  editing,  deleting,  and  cloning  objects,  copying  a 
viewpoint  to  a  different  window  without  an  image,  changing  the 
simulated  camera  position,  acquiring  texture  maps  for  object 
faces,  and  generating  simulated  scene  views. 

2.6  Feature  Manipulation 

Features  are  instantiated  either  by  cloning  existing  features 
or  by  selecting  from  a  pop-up  menu  of  feature  types  invoked 
by  Create  Object.  The  object  types  currently  available  on  this 
menu  a.e  summarized  in  Table  1.  Each  feature  contains  a  set 
of  subfeatures  (such  as  polyhedral  vertices,  edges,  and  faces) 
that  may  be  sensitive  to  the  mouse  pointing  device.  Whenever 
the  mouse  is  pointing  sufficiently  close  to  a  subfeature,  it  is 
highlighted  to  indirate  feature  selection,  and  a  prompt  window 
on  the  screen  is  updated  to  indicate  the  name  of  the  feature  and 
the  menu  of  operations  available  using  the  left,  middle,  and  right 
mouse  buttons. 

A  feature's  geometric  parameters  ran  be  dynamically  modified 
using  a  variety  of  operations  that  exploit  continuous  mouse- to- 
suia-u  feedback.  For  example,  one  of  the  one;  .tines  uses  the 
mouse  to  drag  an  object  around  on  tlm  surface  of  the  surface 
of  the  terrain  model  associated  with  the  view.  As  the  mouse 


changes  the  position  of  the  object  in  the  window,  a  correspond¬ 
ing  ray  from  the  camera  is  intersected  with  the  terrain  model  to 
determine  the  world  position  of  the  object.  All  visible  views  con¬ 
taining  the  feature  are  continuously  updated  to  reflect  changes 
in  the  object’s  parameters. 

There  are  several  types  of  “utility”  objects  that  provide  spe¬ 
cialized  user-interface  capabilities.  Among  these  are 

•  Camera  model  objects.  These  are  graphical  features 
that  allow  the  user  to  interactively  modify  the  parameters 
of  a  camera. 

•  Sun  ray  object.  The  sun  ray  lets  the  user  control  the 
position  of  the  sun  in  a  view.  The  ray  is  adjusted  and  then 
set  to  determine  the  wav  in  which  illumination  is  taken  into 
account  in  rendering  operations. 

•  DTM  Mesh.  A  mesh  representing  a  small  patches  of  the 
digital  terrain  model  (DTM)  can  be  moved  interactively 
around  the  scene  to  help  the  usei  visualize  the  local  terrain 
characteristics. 

Views  with  new  camera  models  can  be  created  by  copying  ex¬ 
isting  views  and  modifying  their  camera  models.  Initially,  these 
new  views  have  no  associated  aerial  image,  and  only  the  wire 
frame  depicilions  of  the  feature  databases  are  visible.  Images 
can  be  generated  for  new  camera  models  by  texture  mapping  an 
existing  image  onto  the  terrain  model  and  rendering  the  feature 
databases. 

3  Programming  Tools 

In  order  to  create  new  scenarios  and  test  the  feasibility  of  new 
concepts,  one  must  be  able  to  manipulate  and  augment  the  in¬ 
ternal  system  data  structures.  Here  we  give  a  brief  summary 
of  some  of  the  structures  and  the  associated  programming  tools 
and  utilities. 

The  features  manipulated  by  the  system  are  all  implemented 
as  instances  of  flavors  in  an  object-oriented  programming  envi¬ 
ronment.  This  means,  in  particular,  that  many  distinct  object 
types  can  share  or  inherit  fundamental  data  fields  and  messages , 
which  are  functions  that  act  intelligently  within  the  data  struc¬ 
tures  of  a  particular  object  instance. 

The  fundamental  object  types  used  by  the  Cartographic  Mod¬ 
eling  Environment,  along  with  a  few  of  the  most  critical  messages 
they  handle,  are  summarized  below: 

•  Viewpoint.  The  viewpoint,  also  referred  to  in  this  doc¬ 
ument  as  a  view,  is  a  fundamental  ImagCalc  flavor  that 
holds  the  information  describing  each  displayable  thing  on 
a  window's  stack.  In  ImagCalc,  the  viewpoint  contains  such 
structures  as  digital  images,  graphs,  and  curve  plots.  In 
the  Cartographic  system,  the  Viewpoint  relates  perspective 
transformations  representing  camera  models  to  correspond¬ 
ing  images,  elevation  models,  and  databases  of  cartographic 
objects. 

•  Image.  An  image  is  a  data  structure  containing  digitized 

image  data.  The  image  formats  supported  include  binary 
images.  X-bit  grey  scale  images,  integer  and  floating  point 
images,  ami  .auitiple-iu  m!  '-uages  -url:  color  ima  ,  ... 

The  image  object  handles  the  message  : display- image 


that  chooses  the  best  way  to  display  it  on  the  current  win¬ 
dow,  dithering  when  necessary.  Image  pixels  can  be  ac¬ 
cessed  and  altered  with  the  :iref,  and  :iset  messages. 

•  Perspective  Transform.  This  family  of  objects  includes 
both  orthographic  transforms  and  4x4  homogeneous  per¬ 
spective  transform  matrices  that  relate  a  world  coordinate 
system  to  a  particular  pixel  in  the  window.  Film  digiti¬ 
zation  parameters  are  incorporated  when  the  window  pixel 
corresponds  to  an  actual  digital  image.  By  sending  the 
messages  :project-to-world  and  :project-to-viaw  to  a 
transform  instance,  o  .e  can  achieve  any  desired  forward  or 
inverse  coordinate  transformation. 

•  Camera-Model  Objects.  This  object  type  is  the  inter¬ 
active  handle  by  which  perspective  transform  objects  (i.e., 
camera  models)  can  be  accessed  and  modified  by  the  user. 
It  accepts  a  family  of  messages  supporting  the  modification 
of  its  three-dimensional  position  and  orientation,  as  well  a* 
its  focal  length  and  piercing  point. 

•  Digital  Terrain  Elevation  Model.  Elevation  models  are 
arrays  of  world  elevation  values  in  a  designated  local  coor¬ 
dinate  system.  The  model  includes  a  transformation  object 
that  translates  from  the  array  grid  coordinates  to  the  hori¬ 
zontal  world  coordinates. 

•  Two-Dimensional  Objects.  These  are  objects  such  as 
text  that  are  strictly  related  to  the  display  window,  and 
move  only  in  the  two-dimensional  window  space. 

•  Three-Dimensional  Objects.  This  class  of  objects  in¬ 
cludes  broad  families  of  objects  that  are  dispiavablc  and 
movable  in  the  three-dimensional  simulated  world.  They 
accept  the  :draw-on-view  message  to  make  a  wire-frame 
depiction,  and  the  :render-on-view  message  to  generate 
a  texture-mapped  view  of  the  object.  The  two  major  sub¬ 
classes  in  this  family  are  planar-faced  objects,  whose  faces 
are  true  planes,  and  smooth-shaded  objects,  whose  faces  are 
only  arbitrarily  tesselated  representations  of  a  smoothly  in- 
terpolatable  heuristic  or  mathematical  surface. 

•  Three-Dimensional  Curves.  This  family  of  objec'i  cor¬ 
responds  to  roads,  boundaries,  and  delineations  of  arious 
types.  In  addition  to  the  usual  motion  and  depic  on  ca 
pabilities,  these  objects  can  have  individual  vertices  added, 
deleted,  or  edited  independently. 

Rendering  facilities  form  an  important  class  of  the  system 
capabilities.  The  available  rendering  operations  include 

•  Buffered  line-drawing.  This  allows  complex  objects  to 
be  moved  with  minimal  flicker  because  the  erasure  step  is 
buffered  to  occur  at  the  same  time  as  the  next  draw. 

•  Z-buffering.  Three-dimensional  scenes  may  have  arbitrar¬ 
ily  complex  configurations  of  objects  intersecting  each  other 
and  the  terrain.  Depth  buffering  is  provided  to  handle  these 
situations  correctly  when  rendering  a  simulated  scene. 

t  A-buffering.  A  substantial  improvement  in  the  A-buffer 
approach  to  anti-aliasing  [Carpenter.  1984]  has  been  devel¬ 
oped  especially  for  this  system.  This  facility  removes  jagged 
edges  and  unrealistic  artifacts  of  the  uncorrected  rendering 
procedure  by  computing  subpixel  contributions  of  all  visible 
taces  aftecting  the  pixel. 


•  Terrain  Calc  Plot.  Texture  mapping  of  an  image  onto  a 
digital  terrain  model  is  handled  by  the  Terrain  Calc  system. 
Among  the  unique  techniques  incorporated  into  this  system 
to  make  the  rendering  as  realistic  as  possible  are  the  use  of  a 
multiresolution  image  hierarchy  and  the  dynamic  selection 
of  local  texture  map  resolutions. 

4  Overview  of  the  System  Configuration 

The  current  implementation  of  the  Cartographic  Modeling  Envi¬ 
ronment  is  a  research-oriented  system  that  is  intended  primarily 
to  be  a  tool  for  feasibility  studies,  rather  than  a  fully  supported 
software  product. 

The  system  is  implemented  in  a  combination  of  Common  Lisp 
and  Zetalisp  with  Flavors  on  Symbolics  Lisp  Machines  running 
Symbolics  Genera  Release  7  system  software.  Image  data  and 
feature  files  can  be  stored  on  any  file  system  on  the  network 
that  supports  a  CHAOSNET  file  transfer  protocol.  Support  is 
included  for  Symbolics  black-and-white  consoles,  for  the  Symbol¬ 
ics  CAD  frame  buffer  systems,  and  for  the  Symbolics  “Hi-Res” 
system.  The  system  displays  grey-scale  images  on  the  black- 
and-white  consoles  using  dithering,  and  can  display  24-bit  color 
images  on  color  systems  with  only  8  bits  of  memory  using  a  color 
dithering  technique.  Experimental  support  is  also  available  for 
the  Tektronix  SGS-420  stereographic  display  system,  which  is 
used  in  conjunction  with  the  Symbolics  CAD  buffer  to  provide 
double-buffered  512  x  512  stereo  displays  that  refresh  at  GO-Hz 
per  eye,  120-Hz  overall.  The  system  has  not  been  transferred  to 
other  hardware  and  software  configurations  at  this  time;  the  fea¬ 
sibility  of  such  a  conversion  effort  is  expected  to  depend  strongly 
upon  the  availability  of  high- performance  graphics  systems,  the 
development  of  correspondingly  capable  window  system  stan¬ 
dards,  a:  d  the  availability  of  high-performance  object-oriented 
extensions  to  Common  Lisp. 

This  article  is  an  initial  version  of  the  first  of  three  manuals 
that  are  being  written  to  document  the  system.  The  documen¬ 
tation  planned  at  this  time  consists  of  the  following: 

•  Overview  of  the  SRI  Cartographic  Modeling  Envi¬ 
ronment.  A  brief  overview  of  the  system  and  its  capabili¬ 
ties  [this  document]. 

•  Users’  Guide  to  Interactive  Use  of  the  SRI  Carto¬ 
graphic  Modeling  Environment.  A  description  of  the 
tasks  a  system  user  can  perform  interactively,  using  the 
mouse  and  keyboard  to  issue  commands. 

•  Programmers’  Guide  to  the  SRI  Cartographic  Mod¬ 
eling  Environment.  A  detailed  description  of  the  sys¬ 
tem’s  internal  structure  for  users  needing  to  customize  the 
system  to  their  own  needs  and  build  new  capabilities  within 
the  context  of  the  environment. 

In  addition,  a  series  of  videotapes  is  planned.  An  overview  pre¬ 
sentation  will  illustrate  the  basic  nature  and  capabilities  of  the 
system  and  will  be  a  companion  to  the  Orcrvu  tr  manual;  a  fam¬ 
ily  of  shorter  tapes  dealing  with  specific  applications  within  the 
environment  will  be  produced  from  time  lo  time. 

5  Summary 

TIip  implement  at  I™'  ^f  the  Cartography  Myelin*; 

Environment  has  many  capabilities  that  are  of  interest  for  oval- 


uating  computer-based  approaches  to  digital  cartography.  In 
addition,  the  system  is  well-suited  for  interacting  with  three- 
dimensional  simulated  environments  and  for  performing  gen¬ 
eral,  computer-aided  three-dimensional  modeling  and  rendering 
tasks.  Future  development  of  the  system’s  capabilities  will  in¬ 
clude  the  following  areas: 

•  Supporting  symbolic  representations  of  scene  relationships 
in  the  style  of  semantic  networks. 

•  Improving  computer-assisted  feature  entry  and  constraint 
exploitation  capabilities. 

•  Extending  scene  simulation  to  additional  types  of  sensors. 

•  Exploring  techniques  for  high-speed  manual  or  semiauto- 
mated  entry  of  high-resolution  feature  data. 

Open  problems  for  research  in  the  roiitcxt  of  this  system  in¬ 
clude  such  tasks  as  simulation  of  scenes  at  arbitrary  times  of  day, 
supporting  irregular  and  topologically  complex  terrain  data,  and 
incorporating  the  notions  of  time  and  physical  constraints  into 
the  scene-generation  facilities  to  support  complex  animation  re¬ 
quirements. 
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Figure  1:  Block  diagram  of  the  major  components  of  the  SRI  Carto¬ 
graphic  Modeling  Environment. 
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0  ABSTRACT 

In  this  paper  we  formalize  a  model  of  topological 
navigation  in  one-dimensional  spaces,  such  as  single  roads, 
corridors,  or  transportation  routes.  We  formally  define  the 
concepts  of  a  direction  and  a  custom  map,  as  well  as  some 
specifications  for  motor  and  sensor  models,  including  the  idea 
of  sensor  synchronicity.  We  discuss  the  representations 
necessary  to  model  and  exploit  differences  between  the  world 
itself  (here,  a  version  of  "Lineland"),  the  world  as  perceived  by 
the  map-maker,  and  the  world  as  experienced  by  the 
navigator.  We  demonstrate  the  difficulty  of  giving  precise 
meaning  to  what  is  meant  by  a  “good"  map,  and  what  is  meant 
by  a  "landmark’.  We  prove  that  even  in  the  simplest  case 
where  both  motor  and  sensor  control  is  error-free,  NP- 
complete  problems  arise  in  Lineland,  even  while  attempting  to 
navigate  from  one  single  object  to  another.  Forced  to  use  the 
A*  search  method,  we  provide  heuristics  that  appear 
reasonable  for  map  creation,  and  give  examples  of  several 
maps  that  are  "good"  under  varying  criteria. 


1  INTRODUCTION 

In  this  paper,  we  define  and  explore  in  a  simplified  way 
a  model  for  characterizing  a  certain  class  of  robotic  navigation 
problems,  that  of  navigation-in-the-lange  within  a  linear 
environment.  ‘  It  is  in  contrast  to  much  existing  work  on  robotic 
navigation,  which  attempts  to  learn  through  direct  experience  a 
metric  map  for  a  small  two-dimensional  space  such  as  a  room 
[9],  Instead,  our  emphasis  is  on  the  specification,  efficient 
creation,  and  use  of  topological  maps  for  a  large  one¬ 
dimensional  space,  such  as  a  corridor  or  subway  route.  The 
formalization  we  have  chosen  we  believe  will  also  be  useful  for 
other  in-the-large  problems,  whether  1 ,5-dimensional  (such  as 
highway  networks),  or  variations  on  two-dimensional  (such  as 
building  floors,  forests,  or  oceans).  Since  a  large  body  of 
evidence  suggests  that  humans  use  topological  maps  for 
large-scale  cessation  [3, 8J,  the  formalization  may  also  prove 
useful  to  cognitive  scientists. 

We  believe  the  value  of  this  paper  is  three  fold.  First,  it 
formalizes  several  important  aspects  of  the  map-maker’s 
relationship  to  both  the  world  that  is  abstracted,  and  to  the 
navigator  that  will  follow  the  map.  Second,  it  documents  the 
surprisingly  large  range  of  problems  that  navigation  even 
along  a  line  entails.  Third,  it  shows  how  even  in  the  simplest 
cases  heuristics  are  not  only  conveniunt,  out  are  absolutely 
necessary,  since  several  of  the  problems  are  provably  NP- 
complete;  some  may  be  harder  still. 
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The  paper  is  ultimately  motivated  by  two  questions: 
what  is  a  ’good"  map,  and  what  is  a  "landmark".  In  attempting 
to  answer  the  questions,  we  investigate  classes  of  maps  (and 
specialize  on  the  concept  of  "custom"  map),  and  classes  of 
sensor  and  motor  models.  Although  this  paper  ignores  issues 
of  error,  it  relates  the  concept  of  "good"  map  to  navigational 
costs  of  various  sorts;  they  are  then  demonstrated  in  with 
several  simple  examples. 


2  OUR  WORLD  VIEW:  THE  WORLD,  THE 
MAP-MAKER,  AND  THE  NAVIGATOR 

Many  navigation  environments  are  essentially  one¬ 
dimensional.  Doors  on  a  corridor;  exits  on  a  highway;  bridges 
or  docks  on  a  river;  the  stops  on  a  train,  bus,  subway,  or  even 
plane  route-they  all  can  be  abstracted  as  the  (elaborate) 
values  of  a  function  defined  on  some  closed  interval  of  the  real 
line.  Because  they  are  static,  these  environments  in  some 
sense  are  even  simpler  than  the  world  of  "Lineland",  an 
imaginary  one-dimensional  world  postulated  by  the  inhabitants 
of  the  imaginary  two-dimensional  Flatland,  In  Abbott's 
prescient  novel  by  the  same  namejl).  Dynamic  one¬ 
dimensional  worlds  do  exist,  usually  when  there  are  multiple 
navigators;  we  do  not  address  them. 

We  find  it  critical  to  distinguish  three  similar  but  subtly 
different  perceptions  of  "the  world".  The  one-dimension  world 
as  it  exists  and  is  experienced  by  both  map-maker  and 
navigator  is  mathematically  rich:  it  is  continuous,  it  has  a 
distance  measure,  and  objects  "embedded"  in  it  can  have 
finite  extent.  Much  of  this  appears  to  be  extraneous:  many 
such  worlds  consist  largely  of  the  essentially  empty  space 
between  objects,  much  navigation  can  be  done  by  simple 
object  order  rather  than  object  distance,  and  objects  can  often 
be  considered  to  be  point-like. 

We  therefore  postulate  that  the  map-maker,  omniscient 
and  error-free,  has  abstracted  the  world  into  a  sequence:  that 
is,  the  world  is  conceived  as  a  function  over  the  integers. 
Empty  space,  distance,  and  object  extent  are  ignored.  The 
omniscience  of  the  map-maker  ignores  the  very  difficult  issues 
of  map  ’nduction  (see  [4]),  and  the  infallibility  dodges  the 
issues  of  partial  or  errortul  information,  deliberate  camouflage, 
unexpected  events,  or  even  the  proper  navigator  strategies  for 
"believing"  a  map  and  for  recovering  from  any  errors 
(strategies  that  depend,  in  part  on  the  navigator’s  model  of  the 
inap  indkeij.  H„,vever,  to  our  map-maker,  a  plane  flight  would 
be  simply  the  sequence,  <New  York,  London,  Paris,  Rome>. 

The  map-maker  communicates  even  less  of  this 
sequence  to  the  navigator  in  the  form  of  a  "best"  "custom  map" 
of  "landmarks";  this  process  is  the  focus  of  the  paper. 
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Throughout  we  assume  that  there  is  only  one  navigator,  and 
that  the  map-maker's  omniscience  extends  to  a  perfect  model 
of  the  navigator's  sensory  endowment:  their  number,  range, 
responses,  and  interdependencies.  The  interesting  conflicts 
that  arise  when  the  map-maker  mismodels  the  navigator,  their 
severity,  and  their  recoverability  we  ignore,  but  we  note  that 
such  problems  recapitulate  many  of  the  issues  in  user 
modeling.  Similarly,  we  do  not  explore  how  constraining  it  is 
for  a  map-maker  to  service  a  diverse  clientele  of  widely 
varying  sensory  agents,  the  sensory  endowments  of  which 
may  not  even  overlap. 

The  navigator  perceives  the  world  in  the  most  limited 
way.  Sensory  range  is  limited,  perhaps  only  to  the  current 
object  being  experienced,  and  there  is  virtually  no  memory. 
However,  this  is  an  actual  advantage;  usually  the  less  that 
needs  to  be  sensed  or  remembered  the  better.  Although  more 
intelligent  navigators  can  be  modeled,  this  one  chooses  to 
ignore  most  of  the  world  and  its  features  for  the  sake  of 
efficient  traversal.  Generalizations  to  self-teaching,  self- 
correcting  navigators  (ones  that  would  discover  short-cuts 
while  attempting  to  recover  from  error,  etc.)  can  certainly  be 
modeled  and  explored;  not  here. 

Although  we  do  not  consider  it  here  directly,  it  is  often 
the  case  that  the  map-maker  and  the  navigator  are  the  same 
agent;  however,  for  clarity  we  insist  here  that  they  be 
considered  informationally  separate,  with  the  only 
communication  between  them  being  the  giving  of  the  custom 
map  from  the  map-maker  to  the  navigator.  In  human  terms, 
this  dual  role  often  occurs  when  a  person  reads  a  road  map 
(that  is,  consults  a  representation  of  the  world  abstracted  in 
terms  of  navigator  sensory  capabilities),  summarizes  in  verbal 
terms  an  efficient  projected  route  (that  is,  creates  a  "best" 
"custom  map"  of  "landmarks") ,  and  then  follows  the  road  while 
talking  to  himself  (that  is,  executes  the  custom  map  as 
navigator).  It  is  important  to  note  the  three  levels  at  work  here 
again:  there  is  the  world,  the  abstraction  of  it,  and  the  custom 
map. 

For  consistency,  we  will  refer  to  the  world  as  "Uneland", 
the  abstraction  of  it  as  the  "world  model"  or  the  "abstract 
sequence",  and  the  custom  map  as,  simply,  the  "map".  In 
layman's  terms,  the  world  model  is  usually  called,  itself,  a 
"map",  and  the  custom  map  would  probably  be  called  "the  list 
of  directions."  However,  in  some  instances,  particularly  at  car 
rental  agencies  equipped  with  electronic  custom  map-makers, 
the  layman  can  be  given  exactly  what  we  call  a  map,  if  he  has 
a  unique  destination  in  mind.  Surprisingly,  recent  research 
shows  that  maps  as  we  define  them  are  preferred  by  human 
beings  over  the  Exxon-variety  world  models  [11]. 


3  FORMALIZATION  OF  THE  NAVIGATOR 

We  now  specify  in  some  detail  the  definitions  and 
representations  that  are  necessary  to  investigate  the  difficulty 
of  navigating  in  such  a  domain.  In  this  paper  we  do  not 
address  most  issues  of  error  or  uncertainty:  dynamism  in 
Uneland,  partial  map-maker  information,  sensory  error,  and 
motor  inaccuracies  are  all  ignored.  We  simply  note  that  there 
are  many  wavs  to  model  each,  and  that  the  addition  of  some 
uncertainties  to  the  problem  may  even  raise  its  level  of 
difficulty  in  the  complexity  hierarchy  (5). 


3.1  THE  ABSTRACTION  OF  LINELAND 

The  representation  of  Jneland  as  a  sequence  depends 
critically  on  several  epistemological  assumptions  that  the  map- 
maker  makes  with  regard  to  the  (single)  navigator's  sensory 
abilities  and  the  map-maker’s  model  of  them. 
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To  consider  objects  as  point-like  requires  enough 
processing  intelligence  in  the  navigator  to  recognize  an  object 
as  a  single  object,  and  to  capture,  hold,  and  dismiss  the  single 
object  from  its  sensory  array;  in  effect,  the  navigator  can 
"debounce".  This  is  an  important  consideration,  since  the 
"experience"  of  an  object  must  be  a  unified  whole,  even  if  the 
prior  and  succeeding  experiences  are  identical.  Three 
identical  doors  on  a  corridor  must  give  rise  to  the  experience 
of  “door"  exactly  three  times,  even  though  the  doors  are 
sensed  at  some  distance,  have  spatial  extent,  and  are  still 
sensible  after  some  distance,  often  even  when  the  next  door  is 
also  sensible.  In  particular,  the  experiences  of  the  door's 
height,  color,  shape,  size,  and  other  features  must  occur 
exactly  in  synchrony,  all  switching  the  navigator’s  appropriate 
sensors  on  and  off  at  the  same  "time",  although  the  navigator 
may  selectively  "attend"  to  any  subset  of  the  full  sensory  array. 
It  is  not  clear  what  sort  of  world  model  results  if  these 
constraints  are  relaxed. 


3.2  THE  MODEL  OF  THE  NAVIGATOR’S  SENSORS 

The  map-maker  must  model  the  extent  to  which  the 
navigator  can  sense  remotely,  this  may  be  sensor  dependent. 
For  example,  the  sensor  might  return  distant  information  with 
less  reliability,  or  the  sensor  might  be  distance-  or  instance- 
limited.  These  interesting  problems  are  simplified  here  by  the 
assumption  that  none  of  the  navigator's  sensors  are  far¬ 
sighted;  they  can  verify  the  imminent  presence  of  an  object 
only.  For  the  sake  of  argument,  the  navigator  can  be 
considered  to  sense  by  touch  only,  although  there  may  be 
many  attributes  (many  sensors)  to  describe  the  sensations. 
Alternatively,  the  navigator  can  be  considered  to  be  near¬ 
sighted,  or  operating  in  a  fog. 

Given  the  model  of  unified,  immediate  perception, 
Lineland  can  now  be  modeled  by  the  map-maker  as  being  a 
vector-valued  sequence;  each  object  in  Lineland  is  reduced  to 
exactly  one  vector  of  sensations  as  experienced  by  the 
navigator,  one  component  in  the  vector  for  each  sensor.  We 
place  no  restrictions  on  modalities  of  the  sensors,  and  they 
may  be  any  measurable  quality  such  as  color,  distance,  area, 
texture,  shape,  number  of  holes,  etc.:  an  object  can  be 
perceived  as  "green,  far,  large,  rough,  square,  two".  However, 
we  require  that  each  such  sensor  indicate  stimulation  by  an 
object  by  presenting  to  the  navigator  a  symbol  from  a  discrete 
set;  we  bypass  the  problems  of  continuous  values,  multi¬ 
dimensional  or  structured  values  ("icons"),  and  all  their 
extensions  to  probabilistic  sensations  (for  example,  "mainly 
blue",  "possibly  a  face",  etc.). 

Thus,  the  map-maker's  view  of  the  Lineland  is  really  the 
navigator’s,  except  omniscient  and  error-free.  Although  the 
map-maker  may  have  at  disposal  many  ways  of  perceiving 
Lineland  as  an  occasional  navigator,  too,  only  those  that  the 
specific  navigator  can  respond  to  are  considered  in  map¬ 
making.  Thus  the  map-maker’s  view  of  Lineland  for  purposes 
of  map-making  may  vary  greatly  from  any  "personal" 
experience  of  it.  Human  examples  include  the  invisibility  of  IR, 
UV,  SAR,  and  other  non-natural  sensations,  which 
nevertheless  can  be  used  in  directing  a  non-natural  navigator. 

The  navigator  is  modeled  as  having  exactly  S  sensors, 
and  each  sensor  sk  takes  on  its  values  from  its  associated 
discrete  domain  dk.  For  example,  d,  might  be  (red,  green, 
blue).  Thus,  formally,  Lineland  is  modeled  by  the  map-maker 

as  cobject,,  object2 .  objectL>,  where  each  object  is  an 

element  of  the  cartesian  product  of  the  sensor  domains.  Thus, 
«big,red>,  <little,blue>,  <big,  blue»  is  a  trivial  Lineland 
model,  as  is  «New  York>,  <London>,  <Paris>,  <Rome». 
The  Lineland  model  sequence  inherits  a  natural  ordering; 
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sometimes  we  will  refer  to  the  predecessor  and  successor  of 
an  element  as  "left"  and  “right",  respectively. 

Although  we  will  never  use  it,  we  can  also  define  the 
model  of  intervening  empty  space  as  an  s-vector  of  "nulls”, 
indicating  nothing  can  be  perceived  on  any  sensor  other  than 
its  normal  background.  The  sensation  vector  is  therefore 
either  all  nulls  or  all  non-nulls,  and  no  partially  perceived 
objects  are  permitted  here,  although  a  "partially  null  sensation" 
may  well  be  taken  as  a  working  definition  of  camouflage. 

Although  the  map-maker's  experience  of  Lineland  is 
error-free,  the  navigator's  may  not  be.  There  are  many  ways 
to  specify  sensor  error,  with  one  of  the  simpliest  being  a 
confusion  matrix  For  each  sensor,  the  matrix  indicates  the 
probability  that  a  given  value  of  its  domain  will  be  perceived 
mistakenly  as  another  value.  Usually  the  matrix  can  be 
extended  to  include  nulls,  thus  recording  the  probability  of 
falsely  positive  and  falsely  negative  existences  of  objects. 
Given  a  navigator  that  has  a  large  sensory  endowment,  some 
representation  must  be  made  of  sensor-sensor  error 
interaction,  which  itself  may  be  probabilistic.  For  example,  if 
total  agreement  among  all  sensors  is  necessary  for  the 
establishment  of  object  existence,  then  as  the  number  of 
sensors  increases  the  number  of  falsely  negative  objects 
increase,  and  the  number  of  falsely  positive  objects  decrease. 

More  irritating  is  the  non-zero  probability  of  the  navigator 
navigating  wildly:  circling  hopelessly  under  confused  sensory 
perceptions,  or  charging  off  beyond  any  objects  and  towards 
either  "end  of  the  world".  We  note  in  passing  that  humans  are 
not  immune  to  this  behavior.  Special  considerations  then  have 
to  be  made  about  error  recovery,  and  even  of  the  semantics  of 
the  “end  of  Lineland".  For  these  and  other  reasons  we 
assume  here  that  the  navigator’s  sensors  are  perfect. 


3.3  THE  MODEL  OF  SENSOR  SYNCHRONICITY 

The  map-maker  must  also  model  the  dependencies  of 
the  various  sensors  on  each  other,  since  the  creation  of  a  map 
implicitly  will  attempt  to  exploit  those  sensors  that  are  most 
available  and  least  costly.  In  fact,  map-making  can  be  defined 
as  the  selection  of  the  proper  subset  and  sequencing  of 
sensors  so  as  to  minimize  the  navigator's  cost,  measured  in 
various  ways,  of  proceeding  through  and  attending  to  the 
world  while  attaining  a  goal  location. 

Since  the  definition  of  a  sensor  is  broad,  often  it  does 
not  correspond  to  a  particular  piece  of  hardware.  For 
example,  area  and  perimeter  measures  are  considered  to  be 
separate  sensors,  although  typically  they  can  be  calculated 
simultaneously  from  common  information.  More  to  the  point, 
they  can  be  assumed  to  be  available  for  parallel  sensing,  with 
a  cost  (if  measured  in  time)  equal  to  the  maximum  of  their 
individual  costs;  further,  they  may  imply  a  further  time  cost  of 
any  prerequisite  "low-level"  sensing.  Thus,  whether  sensors 
are  "virtual"  (as  above)  or  real,  they  may  interrelate  in  complex 
ways,  having  various  data  dependencies;  these  dependencies 
in  turn  interact  with  the  navigators'  ability  to  simultaneously 
"attend"  to  appropriate  subsets  of  them  in  parallel. 

We  bypass  the  difficult  question  of  finding  a  formalism 
for  modeling  these  synchronicity  relationships  in  general. 
Probably  some  directed  graph  structure  will  suffice  for  data 
dependency  (i.e.,  (a.b)  if  sensor  a  is  a  prerequisite  for  sensor 
b),  and  probably  a  separate  undirected  graph  structure  will 
suffice  for  parallelizability  (i.e.,  vx.y)  if  sensor  x  can  be  run  co¬ 
parallel  with  sensor  y);  the  nodes  themselves  can  be  labelled 
with  their  time  or  other  costs.  We  simplify  our  task  here  by 
assuming  that  all  sensors  are  data-independent-they  do  not 
depend  on  the  results  of  any  other  sensor  as  a  prerequite-and 
execute  in  unit  time  cost  and  unit  "sensor"  cost  (roughly,  a 
measure  of  perceptive  difficulty).  This  leaves  us  to  explore  the 


parallelization  relations. 

The  map-maker  has  the  difficult  chores  of  properly 
selecting  a  sensor  subset  and  exploiting  any  sensor 
parallelism.  For  example,  depending  of  the  surrounding 
objects  with  which  it  appears,  the  green-far-large-rough- 
square-two  object  above  might  be  most  efficiently  sensed 
simply  by  1)  "green",  if  all  the  other  objects  nearby  are  red,  by 

2)  "green-large",  if  all  the  other  objects  nearby  are  red  or  small 
or  both,  and  the  color  and  area  sensors  are  parallelizable,  by 

3)  "green  then  large",  if  all  the  other  objects  nearby  are  red  or 
small  or  both  but  the  color  and  area  sensors  not  parallelizable, 
and  the  map-maker  has  noticed  that  more  of  the  objects  are 
red  than  small,  thus  making  the  color  sensor  less  likely  to 
require  the  area  sensor  as  backup,  or  by  4)  "large-rough  then 
green*,  if  the  situation  is  as  in  3),  except  that  objects  tend  to 
be  small  and  smooth,  and  size  and  texture  are  parallelizable 
with  each  other,  but  one  or  the  other  or  both  are  not 
parallelizable  with  color.  Many  others  are  possible,  but  even 
counting  their  number  is  highly  complex,  since  it  involves 
sequences,  subsets,  and  partitions  (all  of  exponential 
complexity  or  greater),  under  arbitrary  side  conditions.  In 
summary,  part  of  the  map-maker's  job  is  to  find  the  most 
efficient  sequence  of  disjoint  parallel  sensor  subsets  that  the 
environment  and  the  parallelization  graph  permit. 

(An  aside  concerning  linguistics.  Sensor  sequencing  is 
roughly  analogous  to  the  ordering  of  adjectives  in  English.  For 
example,  the  meaning  of  "a  little  brown  house"  is  somewhat 
different  from  that  of  "a  brown  little  house",  and  implicitly 
communicates  some  of  the  nature  of  the  environment. 
Stretching  the  point,  one  occasionally  sees  written  "a  brown, 
little-but-sturdy  house",  and  the  like.  Further,  the  analogy  also 
carries  over  in  the  sense  that  there  are  "data  dependencies" 
governing  the  sequence  of  adverbs,  adjectives,  and  nouns  in 
noun  phrases:  the  article  always  comes  first,  number  comes 
next,  adverbs  are  immediately  before  their  adjectives,  etc., 
"the  three  very  brown,  little-but-sturdy  houses"  has  only  one 
other  allowable  sequencing  [6].) 

Rather  than  allow  the  navigator  to  have  an  arbitrary 
parallelization  graph,  or  even  one  that  is  partitionable  into 
components  consisting  of  mutually  parallelizable  sensor 
cliques,  in  this  paper  we  investigate  two  common  extremes. 

The  first  synchronicity  model  we  call  the  purely  parallel 
sensor  model  ("parallel  sensors",  for  short),  which  corresponds 
to  the  case  of  the  parallelization  graph  being  complete.  Thus, 
all  sensors  are  independent,  have  unit  costs,  and  are  fully  co¬ 
parallel.  There  is  no  sequencing;  or  rather,  the  map-maker 
needs  only  to  select  the  proper  subset  of  sensors  to  be  run  in 
parallel  in  the  first  and  only  subset  of  descriptions.  In  short, 
instead  of  a  sequence  of  disjoint  parallel  sensor  subsets,  there 
is  a  single  subset.  In  this  case,  the  time  cost  of  any  sensing  is 
a  unit,  although  the  sensor  cost  is  equal  to  the  cardinality  of 
the  subset.  It  is  to  be  noted  that  the  parallel  sensor  case 
always  requires  data  independence  of  sensors,  whether  or  not 
we  have  assumed  it. 

The  second  synchronicity  model  is  the  opposite,  and  we 
call  it  the  purely  sequential  sensor  model  ("sequential  sensor", 
for  short),  which  corresponds  to  the  degenerate  case  of  an 
empty  parallelization  graph.  Thus,  all  sensors  are 
independent,  have  unit  costs,  and  are  mutually  antagonistic 
with  regard  to  parallelism.  There  are  no  subsets;  or  rather,  the 
map-maker  considers  each  sensor  to  be  a  singleton  set,  and 
needs  only  to  select  the  proper  sequence.  In  short,  instead  of 
a  sequence  of  disjoint  parallel  sensor  subsets,  there  is  a 
simple  sequence.  In  this  case,  the  time  cost  of  any  sensing  is 
greater  than  one,  and  less  than  or  equal  to  the  length  of  the 
sequence,  and  the  sensor  cost  equals  the  time  cost.  It  is  to  be 
noted  that  in  the  sequential  sensor  case,  the  map-maker 
benefits  greatly  from  any  additional  data  dependencies  of 


sensors,  since  they  impose  additional  constraints  on  the 
allowable  sensor  sequences;  as  stated  before,  we  do  not 
assume  any. 


3.4  THE  MODEL  OF  MOTOR  ABILITIES 

In  general,  the  map-maker  needs  a  model  of  the 
accuracy  of  the  navigator’s  abilities,  especially  if  the  custom 
map  will  maKe  use  of  concepts  dependent  on  distance.  For 
example,  a  motor  model  can  be  a  mapping  that  has  as  its 
domain  a  position,  a  heading,  and  a  desired  distance,  and 
gives  as  its  range  a  probability  distribution  of  resulting 
positions  and  headings.  Clearly  such  a  model  can  be 
simplified  in  many  ways;  for  example,  the  motor  performance 
is  often  independent  of  initial  position,  in  Lineland  a  navigator 
has  only  the  choice  of  two  headings,  "right"  or  “left",  or "+“  or 
and,  under  the  map-maker's  world  model  of  Lineland  as  a 
sequence,  all  positions  are  integers. 

However,  in  this  paper,  we  will  disallow  any  direct 
references  to  distances  (we  are  interested  in  topological 
navigation),  so  the  motor  model  becomes  trivial  and  error-free. 
The  map-maker  assumes  that  the  navigator  can  tell  left  from 
right  with  perfect  accuracy,  and  proceeds  from  one  object  to  its 
successor  or  predecessor  in  the  sequence  with  perfect 
accuracy,  too.  As  a  side  issue,  we  note  that  this  eliminates 
any  need  for  the  "reassurance"  directives  that  are  common  in 
human  map-making,  and  which  serve  to  recalibrate  the 
navigator;  e.g.  "if  you  see  an  X  you  have  gone  too  far,  turn 
around  and  look  for  the  V".  Further,  this  eliminates  any  need 
for  "universal"  directives,  which  are  less  often  found  but  make 
the  navigator  self-correcting;  e.g.  "go  until  you  hit  the  river,  and 
follow  it  downstream  to  the  2". 

A  second  component  of  the  motor  model  is  one  that  is 
useful  in  defining  a  measure  of  motor  work;  all  things  being 
equal  we  would  prefer  a  custom  map  that  was  less  costly  in 
terms  of  motor  effort.  Many  things  can  be  accumulated 
towards  motor  cost;  for  example,  instead  of  minimizing 
distance  traveled  in  Lineland,  we  may  choose  to  minimize 
acceleration  or  the  number  of  reversals  of  direction.  We 
simplify  our  task  by  assuming  that  not  only  is  the  motor 
perfectly  accurate,  it  is  perfectly  efficient  and  accumulates  no 
cost. 

Thus,  in  exploring  the  topology  of  navigation,  we  ignore 
even  the  metric  information  that  accumulates  as  total  distance 
traveled.  Our  concern  is  the  complexity  of  the  custom  map 
and  the  sensory  sequencing,  not  distance  traveled.  Thus,  like 
some  humans,  money  is  no  object,  but  minimizing  cognitive 
strain  is.  Our  custom  maps  will  emphasize  compactness  and 
landmarks,  rather  than  motor  energy:  "short",  "easy",  but  not 
"fast".  (In  this  regard,  they  are  like  the  scripts  of  natural 
language  processing,  which  are  mostly  concerned  with  the 
temporal  topology  of  significant  events;  scripts  abstract  away 
all  metric  time  information.)  In  any  case,  we  note  that  in 
Lineland  the  most  direct  route  is  always,  trivially,  a  straight  line 
from  start  to  goal. 


4  FORMALIZATION  OF  THE  CUSTOM  MAP 

We  now  specify  the  grammar  of  the  custom  map,  and 
the  assumptions  made  between  the  map-maker  and  the 
navigator  in  its  creation.  A  custom  map  is  simply  a  sequence 
of  directions  for  the  navigator  to  follow,  and  is  devoid  of  any 
spatial  representations.  We  do  not  formalize  or  consider  the 
use  of  graphic  information  in  communicating  a  custom  map; 
for  example,  we  do  not  address  the  equivalent  of  coloring  in 
the  route  on  an  existing  road  map,  or  augmenting  a  custom 
map  with  a  sketch  commonly  "not  to  scale". 


4.1  PARTIAL  OBJECT  DESCRIPTIONS 

At  the  heart  of  the  map-making  process  is  the  selection 
by  the  map-maker  of  certain  objects  to  be  sought  by  the 
navigator,  and  the  sensors  that  are  necessary  to 
unambiguously  identify  them.  Given  a  sensor  synchronicity 
model,  each  of  these  objects  will  be  described  by  a  sequence 
of  disjoint  parallel  sensor  subsets,  and  the  values  that  each  of 
these  sensors  will  observe  when  the  object  is  obtained. 
Assuming  i  represents  represents  the  index  of  the  ith  sensor 
and  v  represents  the  value  to  be  observed,  a  Partial  Object 
Description  (POD)  is  defined  to  be  a  sequence  of  sets  of 
ordered  pairs,  <{(!,. v,).  (i2,v2) .  (ik,vk)},  {(ik+1,vk+1) . 

('u-vu))’  ->■ 

Since  we  will  investigate  only  two  special  cases  of 
synchronicity,  the  notation  can  be  simplified  for  our  purposes. 
In  the  parallel  sensor  case,  a  POD  is  considered  to  be  a  set  of 

the  form  {(i, .v,).  (i2,v2) .  (iy.Vy)},  and  in  the  sequential 

sensor  case,  a  POD  is  considered  to  be  a  sequence  of  the 

form  <(1,^,),  (i2,v2) .  (<u-vu)>:  in  both  cases  we  have 

dropped  a  different  layer  of  structuring,  so  the  forms  are  nearly 
identical.  Further  notational  simplicity  is  achieved  when  the 
(discrete)  sensory  domains  are  composed  of  disjoint  symbol 
sets.  In  that  case,  the  sensor  index  is  implicit  and  the  form  of 

the  POD  is  either  (vv  v2 .  Vy),  or  <vv  v2 .  Vy>, 

respectively.  For  example,  in  English  the  adjectives  describing 
color,  area,  and  texture  are  all  disjoint,  perhaps  for  this  very 
reason:  it  is  clear  that  (rough,  red,  big)  implicitly  refers  to 
texture,  color,  and  area,  respectively,  whereas  (super,  great, 
fantastic)  conveys  only  a  sense  of  extremity. 


4.2  UNIT  DIRECTIONS 

We  define  a  custom  map  unit  direction  (or  simply,  a 
direction),  to  be  the  shortest  command  that  can  effect  the 
navigator’s  movement.  Under  our  assumptions,  it  is 
comprised  of  two  components:  a  heading  (“-"  or  "+"),  and  a 
partial  object  description  (POD)  that  describes  the  object  to  be 
sought. 

(Strictly  speaking  the  heading  may  be  redundant.  In 
Lineland,  there  is  never  any  advantage  for  a  navigator  to 
proceed  away  from  the  goal,  although  it  is  possible  that  he 
may  overshoot  it  arbitrarily  often.  If  the  navigator  can  keep 
accurate  positional  information,  such  as  via  the  "I  am  here" 
variable  discussed  below,  the  proper  heading  towards  the  goal 
can  always  be  inferred.) 

Thus,  in  the  parallel  sensor  case,  an  example  would  be 
+{big,  red,  house);  it  means  that  the  navigator  should  proceed 
in  ascending  sequence  order  until  the  size,  color,  and  structure 
sensors  simultaneously  obtain  the  mentioned  values.  In  the 
sequential  case,  it  is  nearly  identical:  -<blue,  small>  instructs 
the  navigator  to  travel  to  the  left  until  a  blue  object  is  detected, 
and  if  it  is  also  then  found  to  be  small,  the  desired  object  has 
been  attained. 

Clearly,  more  sophisticated  unit  directions  are  possible. 
Instead  of  a  POD,  the  map-maker  can  specify  a  distance  or 
other  metric  quantity,  such  as  time,  energy,  or  object  count. 
PODs  can  be  replaced  with  "compound  objects",  defined 
under  standard  regular  grammars.  That  is,  a  compound  object 
(CO)  can  be  defined  as  CO  ::=  not  CO  |  CO-CO  |  CO/CO  | 
COn  |  POD,  where  the  semantics  are  "anything  other  than  this 
description"  ("not  a  red  house"),  "the  spatial  concatenation  of 
two  descriptions"  ("a  red  house  followed  by  a  big  tree" ),  "the 
spatial  alternation  of  two  descriptions"  ("a  red  house  or  a  big 
tree"),  "n  occurrences  of  the  descriptions"  ("seven  red  houses 
in  a  row"),  or  a  simple  POD.  Like  all  grammars,  nesting  is 


possible,  generating  exotic  compound  objects  such  as  “three 
occurrences  of  big  houses  next  to  things  other  than  a  big  tree 
or  a  small  pole",  which  have  one  noticeable  advantage:  they 
are  very  unlikely  to  be  falsely  perceived  by  mistake. 

Since  grammars  place  some  constraints  on  the 
processing  power  of  the  navigator’s  equivalent  of  a  short  term 
memory  stack,  we  ignore  them  here.  We  note  however  that 
formal  language  theory,  and  in  particular  the  study  of  LR(k) 
parsers,  addresses  some  of  these  issues,  since  basically  a 
direction  based  on  a  compound  object  is  a  simple  program. 


4.3  CUSTOM  MAPS 

A  custom  map  is  now  defined  as  <start,  direction0, 
goal>,  where  start  and  goal  are  Lineland  sequence  indices 
(navigator  start  and  goal  positions,  respectively),  and  D  is  the 
"length"  of  the  custom  map.  There  are  some  subtle  issues 
involved,  however. 

Th8  start  represents  the  starting  position  of  the 
navigator  to  the  omniscient  and  error-free  map-maker.  For 
now,  we  ignore  the  problem  of  how  such  a  location  is 
determined,  by  whom,  and  in  what  manner.  For  example,  the 
map-maker  may  be  able  to  construct  a  custom  map  for  the 
navigator,  the  sole  purpose  of  which  is  to  take  the  navigator 
from  any  unknown  position  (and/or  orientation)  in  Lineland  to  a 
single  known  location  (a  "landmark");  this  is  sometimes  called 
the  “parachutist's  problem".  Or,  the  navigator  can  motor  in  a 
given  direction  reporting  to  the  map-maker  what  is  (errorlessly) 
observed  until  the  map-maker  knows  the  navigator's  position 
(and/or  orientation)  unambiguously;  this  is  called  the  "shortest 
containing  substring"  problem  and  a  solution  can  be  found  in 
[2],  Or,  the  twr  problems  can  be  combined,  with  the  navigator 
alternately  seeking  and  reporting.  Undoubtably  other  means 
exist,  including  keeping  a  running  "where  am  I"  data  structure, 
or  asking  other  navigators  (who  may  be  lost  or  deceitful,  etc.), 
"Where  am  I?". 

Likewise,  we  are  ignoring  the  epistemological  problems 
of  defining  the  goal.  There  is  little  problem  if  the  goal  is 
imposed  by  the  map-maker;  what  is  more  difficult  is  how  a 
navigator  would  communicate  a  goal  to  the  map-maker.  If  it  is 
by  index,  the  question  arises  as  to  where  the  integer  came 
from  ("other  knowledge",  or  curiosity,  perhaps).  If  it  is  by 
description,  the  description  would  usually  have  to  be 
unambiguous  (unless  the  navigator  is  seeking  an  object  class), 
implying  that  the  navigator  has  some  model  of  the  navigation 
process  and  even  some  knowledge  of  Lineland  in  general. 
There  is  a  stage  twist  at  work  here.  If  the  navigator  is 
primitive,  his  descriptions  to  the  map-maker  of  where  he  wants 
to  go  are  p.imitive,  too,  and  probably  underdetermined  ("it’s 
red");  as  with  the  analogous  human  situation,  progress  is 
unceriain  just  when  it  is  most  needed.  However,  if  the 
navigator  is  savvy  enough  to  uniquely  specify  the  goal,  he  may 
not  need  the  map-maker  at  all  ("three  trees  east  of  the  big  blue 
circular  house"). 


5  NAVIGATING  WITH  A  CUSTOM  MAP 

Having  defined  what  a  custom  map  is,  we  now  examine 
its  use.  Then  we  will  address  its  creation,  which  is  one  of  the 
central  concerns  of  this  paper. 


5.1  MATCHING 

The  most  primitive  decision  a  navigator  must  make  (and 
makes  repeatedly)  is  whether  or  not  its  observation  of  the 
world  is  compatible  with  some  partial  object  description  given 
by  the  map-maker.  We  model  such  an  operation  by  a 


boolean-valued  function  "match",  which  takes  as  input  a  POD, 
and  uses  the  POD  to  schedule  the  appropriate  sensory 
information  to  make  the  judgment. 

In  the  parallel  sensor  case,  all  sensors  must  match  the 
values  in  the  POD  for  the  value  of  match  to  be  true;  otherwise 
match  returns  false.  In  either  case,  match  also  returns  a  time 
cost  of  one  unit  and  a  sensor  cost,  N,  equal  to  the  cardinality, 
U,  of  the  POD  subset.  Since  in  this  paper  we  are  ignoring 
metric  properties  of  time  and  space,  we  will  retain  only  the 
sensor  cost  for  our  future  discussions  about  map  efficiency. 

In  the  sequential  sensor  case,  all  sensors  must  again 
match  the  values  in  the  POD  for  the  value  of  "match"  to  be 
true;  but  as  soon  as  "match"  detects  a  mismatching  sensor 
reading  it  returns  with  false  (that  is,  it  behaves  like  the 
conditional-and  construct  in  programming  languages). 
"Match"  also  returns  a  time  and  sensor  cost,  N,  which  are  both 
equal  to  the  number  of  attempted  matches;  this  is  at  least  one 
and  no  more  than  the  cardinality,  U,  of  the  POD  subsequence. 
Again,  we  retain  only  the  sensor  cost  in  future  discussions  of 
efficiency. 

5.2  TRAVERSAL 

We  now  define  the  semantics  of  a  successful  custom 
map  traversal.  The  navigator  receives  a  custom  map,  (start, 
direction0,  goal),  and  (re)sets  his  internal  "I  am  here"  integer 
to  the  value  of  start.  He  follows  the  k  directions,  updating  the 
"I  am  here"  value  every  time  an  object  is  encountered,  and 
taking  a  new  direction  every  time  "match"  returns  true.  At  the 
final  step,  he  compares  "I  am  here"  to  goal;  the  success  of  the 
traversal  is  defined  to  be  the  value  of  this  equality.  This  last 
step  is  critical.  It  is  insufficient  to  merely  follow  the  directions; 
they  must  have  the  desired  effect.  Since  the  map-maker  is 
omniscient  and  benign,  and  the  navigator  has  perfect  sensors, 
there  is  no  need  to  define  what  happens  when  the  navigator 
runs  off  the  "end  of  the  world"  while  seeking  a  bogus  POD. 


6  SUMMARY  OF  ASSUMPTIONS  AND  EXAMPLE 

Having  established  nearly  all  the  definitions  that  are 
necessary,  we  review  the  assumptions  we  have  made,  and 
exercise  our  vocabulary  in  a  simple  example. 

The  world  is  one-dimensional  and  static,  with  obvious 
objects  that  can  be  represented  as  points.  It  is  distinguished 
from  the  sequence  of  discrete-valued  s-vectors  that  the 
omniscient  map-maker  has  abstracted  it  into,  and  from  the 
custom  map  that  the  primitive  navigator  receives.  There  are 
no  sensory  or  motor  errors  of  any  kind,  and  sensors  have 
effectively  a  range  of  zero.  Sensors  are  independent  of  each 
other,  and  can  be  fired  in  a  purely  parallel  or  purely  sequential 
manner.  We  ignore  any  motor  cost,  concentrating  on  issues  of 
topology.  The  custom  map  is  made  from  simple  directions, 
consisting  of  only  a  heading  and  a  POD.  The  navigator  and 
map-maker  somehow  are  aware  of  the  actual  start  and  the 
desired  goal,  and  the  navigator  has  effective  procedures  for 
determining  if  a  navigation  has  been  successful. 

Now  consider  the  following  world  model  of  a  Lineland, 
which  is  of  length  10  and  in  which  objects  are  described  by  a 
single  feature  (intuitively,  color)  taken  from  the  domain  (r,  g,  b): 
the  sequence  is  <b,g,r,r,r,b,b,b,g,r>.  Since  there  is  only  one 
sensor,  the  parallel  and  sequential  sensor  models  are 
equivalent  and  we  dispense  with  as  much  syntactic  overhead 
in  the  PODs  as  possible.  The  custom  map  maker  can 
therefore  give  directions  of  the  simple  form:  +r,  or  -b. 

The  navigator  asks  for  a  custom  map  from  2  to  8.  The 
custom  map  maker  replies:  (2,  +g-b,  8).  The  navigator 
traverses  the  map  by  (re)setting  the  "I  am  here"  integer  to  2, 
repeatedly  increments  it  while  attempting  to  attain  a  match  to 


the  sensation  of  g,  and  first  attains  a  true  match  when  "I  am 
here"  equals  9.  Reversing  direction,  a  match  to  b  occurs  when 
'I  am  here"  equals  8.  The  directions  being  exhausted,  the 
traversal  is  successful  because  "I  am  here"  equals  goal. 

Note  that  the  custom  map-maker  could  also  have  said 
(2,  +b+b+b.  6),  which  is  a  longer  sequence  of  directions  but 
stiii  correct.  Or  even  (2,  +r+r+g-b,  8),  which  is  longer  still, 
correct,  but  in  some  sense  redundant. 


7  OPTIMAL  UNIT  DIRECTIONS 

We  would  like  to  make  the  idea  of  "efficient"  and  "best" 
map  precise.  Unfortunately,  we  first  need  some  additional 
definitions  and  representations  to  investigate  how  difficult 
these  questions  are. 


7.1  TRANSITION  AND  COST  GRAPHS 

We  introduce  two  abstract  data  types,  the  transition 
graph  and  the  cost  graph.  They  are  abstract  data  types 
because,  depending  on  the  sensor  model  and  the  map  criteria 
being  optimized  (and,  in  some  cases,  on  the  contents  of  the 
world  model  itself),  they  are  realized  as  various  data 
structures.  Thus,  in  some  cases  there  are  best  implemented 
as  arrays,  in  others  as  lists.  In  some  cases  they  are  best 
established  before  the  optimizing  map-making  algorithm,  in 
others  they  are  incrementally  established  cooperatively  with  it, 
and  in  still  others  they  are  never  established  at  all,  but  are 
directly  incorporated  into  the  algorithm  itself. 

Intuitively,  the  graphs  record  information  about  the  most 
efficient  unit  direction,  if  any,  that  takes  a  navigator  from  any 
given  sequence  index  in  the  Lineland  world  model  to  any  other 
index.  If  there  is  such  a  unit  direction,  the  transition  graph 
records  it,  and  the  cost  graph  records  its  cost  (which  can  be 
defined  in  multiple  ways).  Both  graphs  conceptually  contain  a 
special  sentinel  wherever  a  unit  transition  is  impossible;  in 
particular,  there  is  no  unit  direction  that  can  cause  the 
navigator  to  remain  in  the  same  place. 

The  problem  of  attaining  an  efficient  custom  map  from  a 
start  to  a  goal  can  then  be  broken  down  (conceptually,  at 
least)  into  the  establishment  of  efficient  unit  directions, 
followed  by  the  traversal  of  the  unit  direction  graph  from  start 
to  goal.  There  are  times,  of  course,  when  this  is  grossly 
inefficient,  such  as  when  the  navigator  is  traveling  a  very  short 
distance  and  a  type  of  search  is  more  appropriate;  we  do  not 
thoroughly  analyze  such  tradeoffs  here. 

Intuitively,  the  graphs  are  filled  in  in  two  stages. 

7.1.1  FIRST  STAGE;  DEFINITION  OF  "LANDMARK" 

In  the  first  stage  the  transition  graph  is  filled  with 
tentative  unit  directions  of  the  most  inefficient  kind,  namely  the 
full  sensory  description  of  the  unit  direction's  goal.  For 
example,  if  the  lineland  world  model  is  «r,x>,  <g,x>,  <r,y», 
then  the  unit  transition  connecting  indices  1  and  3  would 
initially  be  given  as  +<r,y>  (under  the  sequential  model, 
assuming  disjoint  sensor  domains),  even  though  either  +r  or 
+y  would  suffice.  This  first  stage  is  straightforward  and  easy: 
starting  at  each  position  j,  the  abstract  model  sequence  is 
scanned  backwards,  and  as  long  as  no  object  matching  the 
description  of  objectj  is  found,  the  unit  directive  given  by 
(loosely  speaking)  "+<object.'s  full  description;."  is  entered  at 
transition^, j].  More  accurately,  what  is  entered  at  transition^, j] 
is  +{<1,v1>,  <2,v2>,  ....  <S,vs>)  under  the  parallel  model,  and 

+«1  ,Vj  >,  <2,v2> .  <S,vs»  under  the  sequential  model, 

where  vk  is  the  value  in  slot  k  of  objeetj.  Uneland  is  scanned 
similarly  in  the  opposite  direction  for  directions  beginning  with 


the  opposite  heading. 

As  an  example,  if  the  Lineland  model  is  again 
<b,g,r,r,r,b,b,b,g,r>,  then  the  transition  graph  looks  like  the 
below,  where  empty  entries  are  considered  to  be  filled  with 
sentinel  values. 
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In  some  special  cases,  there  may  even  be  more  efficient 
ways  of  structuring  the  abstract  transition  graph.  For  example, 
in  the  above  case  the  graph  is  quite  sparse;  since  no  row  can 
have  more  than  six  entries  (two  headings  times  three  object 
descriptions),  the  graph  can  be  compressed  using  various 
techniques.  In  general,  if  navigation  is  in-the-large,  by 
definition  there  will  be  a  large  number  of  objects  with  the  same 
object  descriptions,  and  the  transition  graph  will  always  be 
sparse  and  compressible.  In  fact,  the  transition  graph  at  this 
stage  of  processing  can  be  considered  purely  binary,  since 
both  the  heading  and  the  object  description  are  implicit  in  the 
structure  of  the  graph  and  can  be  fully  recovered  from  the 
coordinates  of  any  entry  that  is  not  a  sentinel. 

We  note  In  passing  that  one  definition  of  a  "landmark" 
would  be  an  object  whose  column  in  the  transition  graph  is 
very  nearly  filled,  that  is,  an  object  that  can  be  obtained  under 
a  unit  direction  from  very  nearly  everywhere.  In  the  above 
example,  either  G  object  is  a  fairly  good  landmark.  One  can 
easily  define  a  concept  such  as  the  "landmark  radius"  of  an 
object  in  the  obvious  way.  Unique  objects  would  then  have 
landmark  radii  of  at  least  half  of  Lineland.  One  can  navigate 
great  distances  without  even  knowing  their  indices;  one  just 
remembers  their  (nearly)  unique  description. 

Because  landmarks  can  be  used  to  navigate  over  great 
numbers  of  objects,  they  will  tend  to  appear  regularly  in 
custom  maps.  Note  that  landmarks  need  not  be  "reversible”; 
for  example,  the  fifth  object  in  the  above  example  is  the  third  of 
three  R's  in  a  row,  and  is  only  valuable  as  a  landmark  while 
navigating  left.  This  is  often  the  cause  of  distress  in  human 
man-making;  in  general  custom  maps  are  not  reversible.  The 
first  instance  of  any  object  in  a  long  row  of  like  objects  is  often 
valuable  as  a  unit  direction,  but  it  becomes  the  very  last  object 
in  the  row  when  considered  from  the  opposite  heading  and  is 
nearly  useless. 

7.1.2  SECOND  STAGE 

In  the  second  stage  of  transition  and  cost  graph 
construction,  the  transition  graph  is  optimized  under  the 
appropriate  sensor  model,  and  the  cost  of  the  optimal  unit 
direction  is  recorded  in  the  cost  graph;  sentinel  values  in  the 
transition  graph  giving  rise  to  infinities  in  the  cost  graph.  In  the 
example  above,  since  there  is  only  one  sensor,  there  is  no 
need  to  optimize,  and  the  cost  graph  is  trivially  derived  from 
the  transitions. 

But  in  general,  the  second  stage  is  more  complex, 
conceptually  and  computationally,  for  several  reasons.  First, 
there  may  be  more  than  one  optimal  sensor  subset  or 
subsequence  that  attains  the  optimal  cost  (as  in  the  example 
of  the  +r  vs.  +y  above);  depending  on  the  overall  model  of  map 
goodness  they  may  all  have  to  be  recorded.  (However,  in 


none  of  our  cases  is  this  necessary.)  Secondly,  the 
optimization  of  the  entire  graph  may  probably  be  done  in  a 
more  efficient  way  than  by  optimizing  each  entry  separately; 
this  remains  to  be  demonstrated.  But  third  and  most 
importantly,  under  many  sensor  synchronicity  models  the 
optimization  of  even  a  single  unit  transition  is  NP-complete  in 
the  number  of  sensors.  We  prove  it  to  be  so  for  the  cases  of 
purely  parallel  and  purely  serial  sensing. 


7.2  THE  COMPLEXITY  OF  PARALLEL  SENSING 

We  begin  with  an  example.  Suppose  that  the  navigator 
is  equipped  with  three  sensors  (S  =  3),  and  a  portion  of  the 
transition  graph  has  determined  that  the  unit  direction 
consisting  of  ”+<objectj's  full  description:."  takes  object]  to 
object] .  For  simplicity,  we  assume  that  each  of  the  sensors  is 
binary  valued,  and  that  object's  full  description  as  a  3-vector  is 
<1,1, 1>.  (Alternatively,  the  binary  1  can  be  thought  of  as 
indicating  that  the  given  sensor  slot  has  successfully  matched 
its  value,  whatever  it  may  be;  conversely,  a  binary  0  indicates 
value  failure  in  the  slot.) 

Consider  the  following  subsequence  of  the  Lineland 
model,  which  indicates  the  3-vectors  of  sensed  values  for  the 
seven  objects  from  object;  through  object],  inclusive;  the  value 
“X"  represents  either  a  0  or  a  1 . 

i  3 

si  X0000111 
s2  X0101011 
s3  X1010101 

Under  the  syntax  of  directions  assuming  fully  parallel 
sensor  synchrony,  the  proper  unit  direction  at  transition[i,j]  = 
+{(1,1),  (2,1),  (3,1)};  by  our  sensor  cost  function,  the  amount  of 
sensor  work  is  S|j-i|,  in  this  case,  3*7  =  21.  The  question 
arises  if  any  subset  of  these  three  sensors  is  also  a  valid  unit 
direction  at  less  cost  (clearly,  at  cost  of  either  14  or  7). 

Since  S  is  relatively  small,  it  is  not  hard  to  examine  the 
eight  subsets  exhaustively;  most  are  trivially  inadequate.  It  is 
apparent  that  +{(2,1).  (3.1)}  is  a  valid  unit  direction  at  cost  2*7 
=  14,  and  that  it  is  the  only  cheaper  valid  unit  direction.  As  a 
general  rule,  any  sensor  subset  is  subject  to  the  constraint 
that,  for  each  of  the  interior  objects  in  the  range  i  to  j  exclusive, 
there  must  be  at  least  one  sensor  that  provides  a  value  of  0. 

What  is  valuable  about  the  example  is  that  it  is  the 
minimum  counterexample  to  any  anticipation  that  a  greedy 
algorithm  would  be  useful  in  selecting  a  subset.  A  greedy 
algorithm  would  build  up  the  sensor  subset  incrementally  by 
the  following  method.  From  those  sensors  which  remain, 
choose  whatever  sensor  that  contributes  most  to  detecting  that 
interior  objects  are  not  object];  or,  in  looser  words,  choose 
whatever  sensor  that  provides  the  most  "new"  zeros. 
Following  the  greedy  algorithm  in  the  above  example,  sensors 
would  be  chosen  in  the  order  s,  (which  detects  that  the  first 
four  objects  after  object,  cannot  be  object),  followed  by  both  s2 
and  s3  in  either  order,  since  either  of  tnem  properly  detects 
that  one  or  the  other  of  the  remaining  two  intermediate  objects 
cannot  be  object.  Thus,  the  greedy  algorithm  would  not 
improve  on  the  initial  unit  direction. 

One  is  gradually  led  to  suspect  that  no  simple  algorithm 
for  selecting  the  sensor  subset  would  do  better  than 
exhaustive  search.  This  suspicion  is  based,  in  part,  on  the 
observation  that  the  utility  of  a  given  sensor  depends  upon 
each  and  every  of  the  other  sensors  selected  before  it,  in  a 
non-trivial  way.  A  simple  count  of  the  number  of  mismatches 
(zeros)  in  a  sensor  does  not  indicate  how  many  of  those  zeros 
are  that  sensor’s  unique  contributions  to  the  growing 
collections  of  mismatches. 


The  problem,  in  fact,  is  NP-complete.  We  will  show  that 
it  is  a  straight-forward  instance  of  a  known  NP-complete 
problem,  the  "minimum  cover"  cover  problem  [5],  under  a 
simple  one-to-one  transformation.  The  minimum  cover 
problem  is  one  of  the  21  "original"  NP-complete  problems 
presented  in  Karp’s  landmaik  1972  paper  [7], 

The  minimum  cover  problem  is  stated  as  follows:  Given 
a  collection  C  of  subsets  of  a  finite  set  T,  and  a  positive  integer 
M  <=  |C|,  does  C  contain  a  cover  for  T  of  size  M  or  less?  That 
is,  is  there  a  subcollection  C’  contained  in  the  collection  C,  with 
jC'|  <=  M,  and  such  that  every  element  of  T  belongs  to  at  least 
one  member  of  C’? 

All  that  is  necessary  is  to  identify  T,  C,  and  M.  Take  T  to 
be  the  set  of  interior  objects,  that  is,  object^  to  object]., 
inclusive.  Each  sensor  sk  contributes  one  member  subset  to 
the  collection  C,  consisting  exactly  of  those  objects  that  it  can 
verify  as  not  being  object]  (i.e.  all  its  "zero"  objects).  Take  M  to 
be  any  number  less  than  S,  the  number  of  sensors.  Then  the 
parallel  sensor  problem  becomes  the  minimum  cover  problem, 
since  we  seek  a  subcollection  of  M  sensors  such  that  every 
interior  object  is  verifiable  by  at  least  one  sensor  as  not  being 
object]. 

Of  the  additional  results  that  are  known  about  the 
minimum  cover  problem,  a  few  are  relevant  here.  The 
problem  is  known  to  be  NP-complete  even  if  (translating  to 
sensor  terminology)  each  sensor  rejects  three  or  fewer  interior 
objects.  In  fact,  the  problem  becomes  tractable  only  if  each 
sensor  rejects  no  more  than  two  interior  objects,  for  any 
interior.  Although  in  some  cases  such  sensors  may  be  built 
(for  example,  sensor  k  may  be  designed  to  reject  object  and 
object,^, ),  this  would  require  the  map-maker  to  identify  a 
virtually  unique  feature  for  each  object.  While  not  quite  as 
extreme  as  giving  all  objects  house  numbers,  it  would  still 
require  the  number  of  sensors  to  grow  linearly  with  the  number 
of  objects. 

Having  shown  the  optimization  of  unit  directions  under 
parallel  sensing  to  be  NP-complete,  what  devices  are  there 
other  than  exhaustive  enumeration  that  can  find  a  solution?  It 
appears  that  the  A*  algorithm  [1 0],  a  branch-and-bound 
technique  coupled  with  dynamic  programming  and  the  use  of 
heuristic  underestimates,  is  applicable  here.  Again,  all  that  is 
necessary  is  to  map  this  problem  into  that  formalism. 

Tk9  branch-and-bound  part  is  as  follows.  Our  goal  is  a 
covering  subset  of  sensors  with  least  cardinality,  so  we  will 
minimize,  as  our  A*  cost,  the  subset  size.  The  search 
naturally  begins  with  singleton  subsets,  and  branches  at  each 
step  by  extending  each  candidate  subset  with  all  possible 
individual  sensors  not  yet  included  in  it.  Bounding-the 
elimination  of  infeasible  solutions-naturally  occurs  when  a 
subset  grows  by  a  useless  sensor,  that  is,  by  one  that  does 
not  eliminate  any  additional  objects  (i.e.  one  with  no  "new" 
zeros). 

The  dynamic  programming  part  derives  from  the 
observation  that  order  is  unimportant  in  a  subset,  so  candidate 
subsets  (such  as  {2,3,1})  that  have  already  been  considered 
via  a  different  developmental  history  (such  as  {1,2,3})  are 
eliminated. 

There  does  not  seem  to  be  any  underestimating 
heuristic  function,  however.  This  is  because  any  subset  can 
potentially  become  a  minimum  cover  in  just  one  more  A*  step 
by  the  addition  of  just  the  right  sensor,  and  there  is  no 
information  in  the  subset  that  can  predict  or  deny  that 
possibility.  The  true  heuristic  underestimate  is  thus  uniformly 
zero,  and  the  search  is  basically  breadth  first  (all  singleton 
subsets,  then  all  doublet  subsets,  etc.),  augmented  a  bit  by  the 
efficiencies  of  bounding  and  of  dynamic  programming.  The 
principal  advantage  over  blind  enumeration  is  that  subsets 


;  5  S  - 


with  useless  sensors  are  never  extended. 


7.3  THE  COMPLEXITY  OF  SEQUENTIAL  SENSING 

It  is  not  hard  to  show  that  the  case  of  sequential  sensing 
is  also  NP-complete.  It  is  clear  that  it  is  in  NP  since  it  is 
straightforward  to  verify  any  solution.  We  now  show  it  is  NP- 
complete  because  it  contains  parallel  sensing,  shown  to  be 
NP-complete,  as  a  special  case. 

Let  us  first  rephrase  the  sequential  sensing  problem  in 
the  following  way.  Given  a  cost,  can  we  select  a  subset  of 
sensors  of  size  U,  such  that  some  permutation  of  the  sensors 
achieves  the  goal  of  eliminating  all  interior  objects  (according 
to  the  semantics  of  the  sequential  match  routine),  at  or  below 
the  cost?  Now,  since  we  can  set  the  cost  arbitrarily,  let  us  set 
it  to  be  S|j-i|,  which  is  clearly  the  maximum  cost  possible. 
Since  the  cost  is  allowed  to  be  so  high,  we  need  not  worry  at 
all  about  permutations;  if  the  subset  solves  the  problem  at  all, 
any  of  its  permutations  will  also  meet  the  cost.  Thus,  the 
problem  reduces  to  finding  a  subset  of  size  U  that  eliminates 
all  interior  objects;  this  is  the  parallel  sensing  case  revisited. 

Among  other  things,  this  again  indicates  that  greedy 
algorithms  or  sorting  algorithms  will  fail,  sometimes 
dramatically.  The  following  is  again  a  minimal 
counterexample. 

i  j 

si  X00000011111 
s2  X01010101011 
s3  X10101010101 

A  greedy  algorithm-here,  the  equivalent  of  a  sorting 
algorithm-would  suggest  the  unit  direction  +«1,1>,  <2,1>, 
<3,1»,  that  is,  the  sensors  are  to  be  tested  in  order. 
According  to  the  "match"  routine  for  sequential  sensors,  the 
sensor  cost  for  each  of  the  objects  in  order  is  respectively 
1,1, 1,1, 1,1, 2, 3.2, 3, 3,  for  a  total  of  19,  8  more  than  the  theoretic 
minimum  of  |j-i|  =  11.  However,  the  unit  direction  +«2,1>, 
<3,1»  has  sensor  costs  of  1,2, 1,2, 1,2, 1,2, 1,2, 2,  for  a  total  of 
17.  As  can  be  verified  by  exhaustive  enumeration  of  the  15 
possible  non-empty  subsequences,  this  is  a  global  optimum. 

Further  investigation  shows  more  of  the  difficulty  of  the 
problem:  total  sensor  cost  is  not  necessarily  monotonic  with 
sensor  count.  That  is,  the  most  efficient  sensor  subsequence 
need  not  be  the  one  that  requires  the  least  number  of  sensors. 
Consider  the  following  minimal  example: 

i  j 

si  X00000000111 
s2  X00001111011 
s3  XlillOOOOlOl 

By  inspection,  the  unit  directive  +«2,1>,  <3,1 »  has 
sensor  cost  1+1 +1+1 +2+2+2 +2+1 +2+2  =  17.  However,  the 
unit  directive  +«1,1>,  <2,1>,  <3,1»  has  sensor  cost 
1+1+1+1+1+1  +1 +1  +2+3+3  =  16.  Given  the  complexity  of  the 
problem,  such  anomalies  should  not  be  surprising. 

The  application  of  the  A*  algorithm  to  this  problem  is 
somewhat  more  complex  than  in  the  parallel  sensor  case,  in 
part  because  an  underestimating  heuristic  exists.  Again,  we 
identify  the  branch-and-bound,  dynamic  programming,  and 
heuristic  underestimate  components. 

The  branch-and-bound  part  is  as  follows.  Our  goal  is  a 
covering  subsequence  of  sensors  with  least  sensor  cost.  The 
A*  cost  we  will  minimize  then  is  solely  sensor  cost,  since  it  is 
unrelated  to  subsequence  length,  as  we  have  shown  above. 
The  search  naturally  begins  with  singleton  subsequences,  and 
branches  at  each  step  by  extending  each  candidate 
subsequence  with  ail  possible  individual  sensors  not  yet 
included  in  it.  Bounding  again  naturally  occurs  when  a 


subsequence  grows  by  a  useless  sensor,  that  is,  by  one  that 
does  not  eliminate  any  additional  objects. 

The  dynamic  programming  part  derives  from  the  curious 
observation  that  to  some  extent  the  subsequences  behave  as 
subsets,  and  so  can  be  combined  as  the  subsets  were  in  the 
parallel  sensor  A*  algorithm.  Although  all  the  permutations  of 
a  subsequence  may  have  sensor  costs  that  differ  because  of 
the  order  of  the  sensing,  it  is  only  necessary  to  keep  the 
cheapest  one.  This  is  because,  regardless  of  permutation, 
any  sensor  subsequence  eliminates  exactly  the  same  subset 
of  interior  objects,  it  is  only  the  remaining  subset  of  interior 
objects  that  the  any  newly  added  sensor  can  "see". 

There  is  a  natural  underestimating  heuristic  function, 
other  than  zero.  At  any  given  point,  a  developing 
subsequence  can  determine  exactly  how  many  interior  objects 
it  has  yet  to  eliminate.  At  best,  the  next  sensor  added  to  the 
sequence  will  eliminate  them  all;  more  likely  it  will  only 
eliminate  a  few.  Thus,  uneliminated  object  count  is  a  true 
underestimate  of  future  sensor  cost. 


7.4  EXAMPLE  OF  OPTIMIZED  TRANSITION  GRAPH 

Suppose  we  complicated  our  example  Lineland  a  bit  by 
adding  a  second  sensor  (intuitively,  a  shape  sensor),  whose 
domain  was  simply  (x,y).  Suppose  the  abstract  world  now 
looked  like  <bx,gy,rx.rx,ry.bx,bx,bx.gy,rx>.  (Since  the  domains 
are  disjoint,  we  have  eliminated  the  inner  level  of  brackets,  i.e., 
the  model  really  is  «b,x>,<g,y>, ...>.)  Applying  our 
optimization  under  the  sequential  sensor  model,  we  find  the 
following  transition  graph,  where  notation  has  also  been 
simplified  in  the  obvious  way:  e.g.  "+yr"  is  the  equivalent  of 
"+«2,y>,<1  ,r»",  and  the  order  of  each  entry  is  significant. 
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8  FOUR  MEASURES  OF  MAP  QUALITY 

We  are  now  prepared  to  define  map  quality  according  to 
two  independent  criteria,  and  give  algorithms  that  produce 
optimal  custom  maps  according  to  these  criteria.  Since  we 
have  long  since  dismissed  time  as  a  valid  measure  of 
goodness,  both  in  terms  of  the  time  cost  of  sensing  and  the 
time  cost  of  motoring,  and  since  no  operation  ever  has  any 
error,  there  are  only  two  other  measures  left  to  minimize. 
They  are  the  length  of  the  map,  and  the  total  sensor  cost.  A 
custom  map  is  "good"  if,  among  all  such  maps  that  attain  the 
goal  from  the  starting  position,  it  does  so  with  the  least  amount 
of  directions  or  the  least  amount  of  sensing.  The  criteria  can 
of  course  be  mixed.  Of  all  the  shortest  maps,  we  can  further 
select  the  blindest,  and  of  all  the  blindest  maps,  we  can  speak 
of  the  shortest;  usually  they  will  not  be  the  same  map.  Taken 
together  these  criteria  roughly  capture  what  a  human  would 


call  the  cognitive  complexity  of  a  map. 

We  wish,  then,  to  investigate  algorithms  for  eight  cases 
of  optimal  map-mapping,  four  each  under  the  two  sensor 
models.  The  four  criteria  to  minimize  are:  direction  length, 
sensor  cost,  sensor  cost  primary  and  direction  length 
secondary,  and  the  reverse.  In  most  of  these  cases,  the 
solution  is  given  by  Dijkstra’s  shortest  path  algorithm,  although 
what  critically  differentiates  them  is  how  the  cost  graph  is 
defined. 


8.1  PARALLEL  SENSORS,  MINIMIZING  DIRECTION 
LENGTH 

First  we  specify  what  is  meant  by  direction  length. 
Either  of  two  definitions  apply.  In  the  simplest  case,  it  is 
merely  the  count  of  the  number  of  unit  directions  in  the  custom 
map;  whether  the  subset  in  each  POD  is  short  or  long  does 
not  matter.  A  more  sophisticated  version  holds  that  what 
matters  is  the  total  length  of  the  custom  map,  and  so  each  unit 
direction  adds  a  cost  proportional  to  the  size  of  its  subset.  In 
either  case,  wherever  there  is  a  unit  direction  in  a  slot  of  the 
transition  graph,  the  cost  graph  is  assigned  the  appropriate 
value  (either  1  or  U,  the  subset  cardinality). 

The  solution  is  now  straightforward:  Apply  Dijkstra's 
shortest-path  algorithm  to  the  cost  graph,  and  read  off  the  unit 
directions  from  the  transition  graph.  This  yields  one  provably 
optimal  custom  map;  by  the  appropriate  backtracking,  ail  the 
other  maps  that  are  equally  good  can  be  recovered. 

For  example,  if  Lineland  is  modeled  as  the  sequence 
discussed  above,  <b,g,r,r,r,b,b,b,g,r>,  and  the  parallel- 
sensored  navigator  desires  a  custom  map  with  shortest 
directions  between  2  and  8  then  the  algorithm  yields  +g-b. 

8.2  PARALLEL  SENSORS,  MINIMIZING  SENSOR 
COST 

In  this  case,  wherever  there  is  a  unit  direction  in  the 
transition  graph,  the  cost  graph  is  assigned  N  =  U|j-i|,  that  is, 
the  sensor  cost  of  the  transition  under  the  parallel  sensor 
model,  which  is  the  product  of  the  cardinality  of  the  sensor 
subset,  U.  with  the  distance  between  i  and  j.  Dijkstra  applies 
as  before,  'ote  that  under  this  definition  of  map  quality  there 
is  never  any  occasion  to  generate  a  map  that  overshoots  the 
goal,  as  the  goal  would  then  be  sensed  at  least  twice. 

In  the  prior  example,  one  optimal  custom  map  is 
+r+r+r+b+b+b.  In  fact,  even  with  multiple  sensors,  one  optimal 
custom  map  will  always  consist  of  |j-i|  unit  directions,  with  each 
unit  direction  consisting  of  a  single  sensor  and  value.  This 
corresponds  to  the  trivial  custom  map  that  enumerates  each  of 
the  objects  along  the  path;  in  effect,  it  measures  the  distance 
in  unary  notation.  Note,  however,  that  +b+b+b  also  is  an 
optimal  custom  map  here,  and  is  shorter  in  terms  of  the 
number  of  directions. 

This  leads  us  to  consider  the  following. 


8.3  PARALLEL  SENSORS,  MINIMIZING  DIRECTION 
LENGTH  WITHIN  SENSOR  COST 

The  cost  graph  in  this  case  is  redefined  to  carry  an 
ordered  pair  consisting  of  the  sensory  effort  in  the  first  slot  and 
the  direction  length  in  the  second.  That  is,  cost[i,j]  =  (N,1)  or 
(N,U).  During  Dijkstra  both  costs  are  accumulated  in  parallel. 
However,  any  cos .  comparisons  are  now  done 
lexicographically,  that  is,  sensory  effort  is  compared  first,  and 
any  ties  are  broken  by  a  comparison  of  direction  length.  The 
use  of  an  ordered  pair  is  cleaner  than  combining  the  costs  by 
some  weighting  function  (which  of  course  can  be  used 
instead). 


For  example,  under  this  definition  of  map  quality,  in  the 
Lineland  model  of  <b,g,r,r,r,b,b,b,g,r>  the  best  custom  map 
from  2  to  8  is  given  by  +b+b+b,  with  a  total  cost  of  (6,3).  In 
this  case,  it  happens  to  be  a  unique  solution,  although  it  need 
not  be  in  general. 

Finally,  for  completeness  we  also  consider: 


8.4  PARALLEL  SENSORS,  MINIMIZING  SENSOR 
COST  WITHIN  DIRECTION  LENGTH 

We  simply  reverse  the  cost  definition  to  be  (1,N)  or 
(U,N);  thus,  in  general,  the  order  of  the  vector  reflects  the 
order  of  the  nesting  of  the  objectives.  The  solution,  as 
expected,  is  not  necessarily  the  same  as  in  any  of  the  other 
cases.  For  example,  for  <b,g,r.r,r,b,b,b,g,r>  the  best  custom 
map  from  2  to  8  is  given  by  +g-b  at  a  cost  of  (2,8),  which 
happens  to  be  the  unique  solution,  with  less  cost  than  its 
nearest  competitor,  +r-b  at  a  cost  of  (2,10). 


8.5  SEQUENTIAL  SENSORS 

For  all  four  measures  of  map  quality,  the  algorithms  are 
identical  to  their  parallel  counterparts.  The  values  in  the  cost 
graph  have  the  same  structure  as  in  the  parallel  case,  but 
have  considerably  different  values  under  the  sequential 
sensing  model.  Thus,  the  results  will  also  vary.  (Since  the 
above  examples  used  only  one  sensor,  there  would  be  no 
apparent  difference  in  our  examples.)  However,  there  is  one 
subtle  point  to  make. 

If  the  measure  of  map  quality  is  simply  direction  length, 
then  again  one  has  to  decide  the  cost  of  a  unit  direction:  it  is 
either  1  or  the  length  of  the  subsequence  in  the  POD.  But 
since  it  is  possible  under  the  sequential  sensor  model  to  have 
an  optimal  sensor  subsequence  that  is  actually  longer  than  a 
more  costly  shorter  subsequences  (a  situation  impossible 
under  parallel  sensing),  it  is  not  clear  whether  the  second 
definition  of  direction  length  is  consistent  for  sequential 
sensing.  This  confusion  also  persists  in  the  case  where 
sensor  cost  is  used  to  break  ties  among  equal  direction 
lengths.  In  both  cases,  the  transitions  have  already  been 
optimized  for  sensor  cost  and  may  be  unfairly  penalized  by  the 
more  sophisticated  definition. 
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Abstract 

This  paper  reports  progress  in  range  image  analysis  for 
autonomous  navigation  in  outdoor  environments.  The  goal  of 
our  work  is  to  use  range  data  from  an  erim  laser  range  finder 
to  build  a  three-dimensional  description  of  the  environment. 
We  describe  techniques  for  building  both  low-level  descrip¬ 
tion,  such  as  obstacle  maps  or  terrain  maps,  as  well  as  higher 
level  description  using  model-based  object  recognition.  We 
have  integrated  these  techniques  in  the  navlad  system1 . 

1.  Introduction 

Research  in  robotics  has  recently  focused  on  the  field  of 
autonomous  vehicles,  that  is  mobile  units  that  can  navigate 
under  computer  control  based  upon  sensory  information. 
Several  components  are  involved  in  the  design  of  such  a 
system.  A  high  level  cognitive  module  is  in  charge  of  making 
decision  based  on  the  perceived  environment  and  the  mission 
to  be  carried  out.  Sensory  modules,  such  as  video  image 
analysis  or  range  data  analysis,  transform  the  sensors’  output 
into  a  compact  description  that  can  be  used  by  the  decision¬ 
making  modules.  Low-level  control  software  converts  deci¬ 
sions  into  actions  performed  by  the  hardware. 

In  this  paper,  we  focus  on  one  type  of  sensory  component, 
the  analysis  of  range  data  for  an  autonomous  vehicle  navigat¬ 
ing  ir.  a r.  environment  with  features  such  as  trees,  uneven 
terrain,  and  man-made  objects.  In  particular,  we  study  the  use 
of  the  Environmental  Research  Institute  of  Michigan  (erim) 
laser  range  finder  to  perform  four  tasks:  obstacle  detection, 
surface  description,  terrain  map  building,  and  object  recog¬ 
nition  (Fig.  1-1).  Obstacle  detection  is  the  minimum 
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capability  required  by  an  autonomous  vehicle  in  order  to 
navigate  safely.  Surface  description  is  needed  when  the 
obstacle  detection  is  not  sufficient  for  safe  navigation,  in  the 
case  of  uneven  terrain  for  example,  or  when  a  more  accurate 
description  of  the  environment  is  needed,  e  g.  for  object 
recognition.  Terrain  map  building  is  the  process  by  which 
surface  descriptions  from  different  vantage  points  are  ac¬ 
cumulated  in  a  consistent  map.  Object  recognition 
capabilities  are  required  when  the  vehicle  must  recognize  and 
locate  known  landmarks,  e.g.  a  traffic  sign,  in  order  to  carry 
out  its  mission. 


Figure  1  -1 :  Range  data  processing 

2.  Intermediate  representations  of  range  data 
2.1.  Sensor  data 

Tlte  erim  sensor  derives  the  range  at  each  point  by  measur¬ 
ing  the  difference  of  phase  between  a  modulated  laser  beam 
and  reflection  from  the  scene.  A  two-mirrors  scanning 
mechanism  directs  the  beam  onto  the  scene  so  that  an  image 
of  the  scene  is  produced.  In  the  ERIM  alv  version,  the  field  of 
view  is  +40  in  the  horizontal  plane  and  30  in  the  vertical 
plane,  from  15  to  45  .  The  resulting  range  image  is  a 
64  x  256  8-bit  image.  The  frame  rate  is  currently  two  images 
per  second.  The  nominal  range  noise  is  0.4  feet  at  50  feet. 
The  sensor  also  produces  reflectance  images  in  which  the 
value  or  each  pixel  is  the  amount  of  light  reflected  by  the 
target.  Figure  2-1  shows  an  example  of  a  range  image  and  the 
corresponding  reflectance  image. 
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The  ERlM  sensor  presents  some  limitations:  Since  only  the 
phase  difference  is  measured,  the  range  values  are  known 
only  modulo  64  feet.  This  causes  ambiguity  in  the  range  data. 
The  resolution  degrades  rapidly  as  the  range  increases.  This  is 
due  to  uie  uivergence  of  the  beam,  which  produces  larger 
footprints  as  the  distance  increases,  and  to  the  scanning 
mechanism.  Since  the  scanning  angles  are  discretized  using 
constant  increments,  the  density  of  points  decreases  as  the 
range  increases.  Points  may  be  incorrectly  measured  at  the 
edges  of  objects  due  to  multiple  reflections  of  the  beam.  This 
effect  is  common  to  all  active  scanning  techniques  and  is 
known  as  the  "mixed  points"  problem.  We  have  found  that 
applying  a  median  fdter  to  die  original  image  eliminates  most 
mixed  points. 


(a)  Range  image:  the  darker  pixels  arc  closest 


(b)  Reflectance  image. 


Figure  2-1:  An  example  of  range  and  reflectance  images 


2.2.  Vehicle  centered  coordinates 

The  raw  data  from  the  Ekim  scanner  represent  range  as  a 
function  of  angles  6  and  <t>  of  the  two  mirrors. 


As  shown  in  Figure  2-2,  we  assume  that  a  coordinate  frame 
it,  v>,  0  is  attached  to  the  scanner.  We  use  another  coordinate 
frame,  die  "vehicle"  frame,  i',j',K  to  express  the  measured 
points  so  that  the  resulting  values  arc  vehicle  centered  and  arc 
therefore  independent  of  the  sensor  configuration.  We  can 
thus  derive  the  coordinates  (v,  y,  z)  of  the  point  measured  at 
pixel  (row, cot)  with  range  D.  If  <j>  is  the  angle  between  (ii.0) 
and  the  direction  of  the  measured  beam  (t  at  pixel  i,j,  0  is  the 
angle  between  (it,  and  die  direction  of  the  measured  beam 
d  at  pixel  i,j,  <j)v  is  the  starting  vertical  scanning  angle,  A0 
and  A <})  arc  the  angular  increments,  and  x  is  the  tilt  angle  of 
the  scanner,  that  is  die  angle  between  the  planes  (/*,/’)  and 
(it,  v*)  as  shewn  in  Figure  2-2,  then  the  conversion  is: 


128)  x  A  6 


) = 0V  —  i  x  a  e 


x  =  D( cos  0(cos  xcos  ((>  -  sin  xsin  6)) 
y  =  /)sin  6 

z  =  D (cos  0 (sin  xcos  (ji  +cos  xsin  4>)) 


Figure  2-3  shows  the  data  of  die  image  of  Figure  2-1  after 
conversion  to  vehicle  coordinates  as  viewed  in  die  direction 
of  die  0. 


Figure  2-2:  Conversion  to  vehicle  coordinates 


Kigm  c  2-3:  Overhead  view  of  the  data  of  Figure  2- 1 


2.3.  Bucket  map 

In  outdoor  environments,  the  ground  plane  (f*,  j*)  has  a 
privileged  role:  the  terrain  can  be  modeled  as  a  function 
i  =f(x,y),  x  and  y  being  the  coordinates  on  the  ground  plane. 
In  order  to  take  advantage  of  the  ground  plane,  we  used  a 
intermediate  representation,  the  bucket  map. 

A  bucket  map  is  defined  by  a  regular  grid  on  a  reference 
horizontal  plane.  Each  cell  of  the  grid,  or  bucket,  contains  a 
set  of  measured  points  as  shown  on  Figure  2-4.  The  points 
within  a  bucket  may  all  be  from  the  same  image  or  from 
several  consecutive  images.  The  size  of  a  bucket  is  typically 
30  cms  x  30  cms. 


Figure  2-4:  The  bucket  map  structure 

3.  Obstacle  detection 

The  first  task  of  range  data  analysis  for  navigation  is  to 
report  the  portions  of  the  environment  that  are  potentially 
hazardous.  We  must  identify  individual  objects  in  the  en¬ 
vironment  that  the  vehicle  must  avoid.  Most  obstacle  detec¬ 
tion  algorithms  combine  surface  discontinuities  and  surface 
slope2,3  to  extract  untraversable  regions  in  the  image  using  a 
vehicle  model4.  Faster  algorithms  use  apriori  knowledge  of 
the  terrain,  e.g.  flat  ground  assumption,  by  computing  the 
difference  between  the  range  image  and  the  expected  ideal 
image5.  Since  we  want  to  be  able  to  navigate  in  a  variety  of 
environments,  we  chose  the  first  approach  which,  although 
computationally  expensive,  allows  us  to  handle  uneven  ter¬ 
rains.  Specifically,  we  identify  points  in  the  bucket  map  at 
which  the  elevation  exhibits  a  large  discontinuity  and  points 
at  which  the  surface  slope  is  above  a  given  threshold.  The 
first  set  of  points  corresponds  to  the  edges  of  the  objects,  the 
second  lies  within  the  surface  of  an  object  facing  the  sensor. 
The  obstacle  detection  algorithm  is  divided  into  four  steps: 

1.  Detect  elevation  discontinuities  on  the  bucket 
map.  The  discontinuities  are  computed  in  2x2 
windows  around  each  point; 

2.  Compute  the  surface  normal  for  each  bucket; 

3.  Detect  the  buckets  for  which  the  surface  normal 
whose  angle  with  the  vertical  direction  is  greater 
than  a  given  threshold; 


4.  Extract  connected  regions  from  the  set  of  buck¬ 
ets  detected  at  step  3  so  that  each  region  cor¬ 
responds  to  an  object.  Two  buckets  are  con¬ 
nected  if  they  do  not  cross  the  line  of  elevation 
discontinuity. 

The  rationale  for  using  two  criteria,  elevation  and  surface 
normals,  is  that  the  surface  normals  are  meaningless  at  the 
edge  of  an  object,  and  the  elevation  cannot  be  used  alone 
without  some  knowledge  of  a  ground-plane  on  which  objects 
are  known  to  rest.  One  possible  undesirable  result  of  the 
detection  algorithm  is  that  a  small  portion  of  the  terrain  might 
be  reported  as  an  obstacle  due  to  the  resolution  of  the  bucket 
map.  This  error  can  be  corrected  only  when  a  more  complete 
object  description  is  created.  It  does  not,  however,  sig¬ 
nificantly  affect  die  behavior  of  the  vehicle  since  only  small 
regions  are  involved. 

For  the  purpose  of  obstacle  avoidance,  the  detected  objects 
are  represented  by  polygons  on  the  ground  surface  in  order  to 
be  used  by  a  path  planner.  Figure  3-1  shows  a  range  image, 
the  location  of  the  buckets  classified  as  parts  of  objects,  and 
the  polygonal  representation  of  the  obstacle  map.  The  squares 
indicate  the  buckets  in  which  the  objects  have  been  found. 
The  large  polygon  enclosing  the  map  is  the  boundary  of  the 
portion  of  the  environment  seen  by  the  sensor. 


Figure  3-1:  Obstacle  Detection 

4.  Surface  description 

The  obstacle  detection  algorithm  is  sufficient  for  vehicle 
navigation  in  a  simple  environment  which  includes  only  a 
smooth  terrain  and  discrete  obstacles.  A  typical  example  of 
such  an  environment  occurs  in  road  following  applications. 
We  need  a  more  sophisticated  representation  in  two  cases: 

•  The  surrounding  terrain  is  uneven.  In  that  case, 
part  of  the  environment  that  may  be  hazardous  or 
costly  to  navigate  cannot  be  described  as  discrete 
objects. 
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The  strategy  for  expanding  a  region  is  to  merge  the  best 
point  at  the  boundary  at  each  step.  This  strategy  guarantees  a 
near  optimal  segmentation.  It  has,  however,  two  major  draw¬ 
backs  however:  it  may  be  computationally  expensive,  and  it 
may  lead  to  errors  due  to  sensor  errors  on  isolated  points,  such 
as  mixed  points.  To  alleviate  these  problems,  we  use  a  multi¬ 
resolution  approach.  We  first  apply  the  segmentation  to  a 
reduced  image  in  which  each  pixel  corresponds  to  a  nx  n 
window  in  the  original  image,  n  being  the  reduction  factor. 
This  first,  low-resolution,  step  produces  a  conservative 
description  of  the  image  (Fig.  4-l.c).  The  low-resolution 
regions  are  then  expanded  using  the  full-resolution  image 
(Fig.  4-l.d).  No  new  regions  are  created  at  full  resolution. 
Figure  4-2  shows  the  segmentation  of  an  image  of  uneven 
terrain. 

In  addition  to  the  pure  region  segmentation,  we  use  the 
edges  extracted  from  the  reflectance  image  to  improve  the 
description.  In  the  low-resolution  segmentation  step,  pixels 
that  correspond  to  a  window  that  contains  at  least  one  edge 
pixels  are  removed.  In  the  full-resolution  step,  regions  are 
expanded  so  that  they  do  not  cross  an  edge.  As  a  result,  edge 
pixels  are  all  part  of  the  regions  boundaries.  Explicitly  in¬ 
cluding  edges  improves  the  segmentation  in  two  ways:  First, 
edges  that  correspond  to  low-amplitude  occluding  edges 
separates  regions  that  may  be  merged  in  the  range  image 
segmentation.  Second,  reflectance  edges  can  delineate  sur¬ 
face  markings  that  are  not  visible  in  the  range  image.  Figure 
4-l.b  shows  an  edge  image  obtained  by  applying  a  10  x  10 
Canny  edge  detector. 

5.  Terrain  map  building 

Map  building  is  the  process  of  combining  observations  of 
an  unknown  terrain  into  a  coherent  representation.  Building  a 
terrain  map  serves  two  purposes:  It  allows  to  report  a  set  of 
observations  as  a  product  of  an  exploration  mission,  and  it 
improves  the  performance  of  an  autonomous  vehicle  when  the 
vehicle  traverses  a  previously  mapped  region.  Two  types  of 
information  are  kept  in  a  terrain  map:  the  low-level  measure¬ 
ments  that  are  accumulated  in  a  bucket  map  as  described  in 
Section  2.3,  and  the  terrain  or  object  features. 

The  main  issue  in  the  map  building  process  is  the  matching 
and  enhancement  of  a  map  built  from  measurements  acquired 
from  different  vantage  points.  This  issue  presents  some  chal¬ 
lenging  aspects.  Since  we  do  not  put  any  constraints  on  the 
vehicle’s  trajectories,  observations  of  the  environment  may  be 
radically  different  from  one  observation  to  the  other.  This 
requires  the  identification  of  common  features  between  obser¬ 
vations,  as  well  as  the  use  of  uncertainty  on  the  various 
vehicle  position  estimates.  The  insertion  of  a  new  frame  in 
the  map  must  therefore  proceed  in  three  steps:  First,  the 
current  estimate  of  the  vehicle  position  is  used  to  predict 
matchings  between  the  current  map  and  the  new  observed 
features.  Second,  a  new  position  estimate  is  computed  based 
on  the  matchings  and  the  current  position  estimate.  Third,  the 
map  is  updated  by  inserting  the  new  observations.  This  in¬ 
volves  the  insertion  of  new  measurements  in  the  grid 


representation,  the  insertion  of  the  new  features  in  the  map, 
and  the  updating  of  existing  map  features  that  have  been 
matched  with  newly  extracted  features. 


(a)  Terrain  map  from  one  image 


Figure  5-1:  Terrain  map  building  (including  road  features) 

We  have  included  the  map  building  techniques  in  the 
Carnegie-Mcllon  navlab  system.  A  terrain  map  was  main¬ 
tained  over  a  hundred  meters  while  the  vehicle  was  running 
autonomously  under  control  of  the  road  follov'ing  program. 
The  current  vehicle  estimate  was  used  as  an  initial  estimate 
for  the  matching.  Due  to  memory  limitations,  only  the  portion 
of  the  map  within  a  window  centered  at  the  current  vehicle 
was  kept  in  the  system.  The  features  used  for  the  matching 
part  in  that  implementation  are:  the  primitives  of  the  surface 
description  (as  described  in  Section  3),  the  location  of  discrete 
objects  (as  described  in  Section  2),  and  location  of  the  road 
edges  extracted  from  the  reflectance  channel,  lire  features  are 
weighted  according  to  their  uncertainty,  for  example  the 
variance  Of/  in  the  case  of  planar  features.  The  road  edges  are 
a  special  type  of  features:  since  road  detection  from  reflec¬ 
tance  is  currently  not  reliable,  the  edges  arc  used  only  if  the 
contrast  in  the  current  reflectance  image,  e  g.  the  strength  of 
the  detected  road  edges,  is  high  enough. 
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6.  Object  recognition 


6.1.  Recognition  strategy 

The  goal  of  an  object  recognition  algorithm  is  to  find  the 
most  consistent  interpretation  of  a  scene  given  a  stored  list  of 
primitives  ( Af, , .  .  ,Af„),  the  model,  and  a  segmentation  of 
the  observed  scene,  (S1,..,S/n).  The  algorithm  must  there¬ 
fore  search  all  the  possible  matchings 
(( Af, ,  S  j ) , . . ,  (Af^ ,  Sjn  )).  This  search  being  an  com¬ 
binatorial  problem,  the  main  issue  is  to  prune  the  search  space 
in  order  to  be  able  to  process  complex  scenes.  Many 
strategies  have  been  proposed  for  solving  the  object  recog¬ 
nition  problem6.  The  most  common  approach  is  to  use 
geometric  constraints  to  constrain  the  search,  assuming  the 
objects  are  rigid.  This  approach  may  require  an  accurate 
geometric  model  which  is  usually  not  available  in  our  applica¬ 
tion.  Another  approach  is  to  generate  beforehand  the  possible 
appearances  of  the  object  to  recognized  in  order  to  reduce  the 
search  space  by  precompiling  the  constraints  in  the  model7,8. 
We  use  a  combination  of  both  approaches  in  which  geometric 
constraints  are  precompiled  in  the  model.  The  model  has  two 
components:  a  set  of  surface  patches,  Af,  and  a  set  of  con¬ 
straints,  C.  The  constraints  encapsulate  knowledge  about  the 
object’s  shape,  such  as  "surfaces  Af.  and  Af.  are  orthogonal’’. 
A  constraint,  c,  associated  with  a  set  of  regions  ( Af, , ... ,  Af ,.„ ) 
can  be  viewed  as  a  function  that  decides  whether  a  partial 
matching  (( Af,., (Af  n , Sjn ))  is  acceptable.  The  num¬ 
ber  n  of  regions  involved  may  be  different  depending  on  the 
constraint.  For  example  the  constraint  on  the  area  of  a  region 
is  a  unary  constraint,  while  the  orthogonality  constraint  is  a 
binary  constraint.  The  list  of  constraints  and  their  implemen¬ 
tation  are  discussed  in  the  next  two  Sections. 

The  search  algorithm  first  constructs  a  list  of  candidates  for 
each  model  region,  Af.,  by  applying  the  unary  constraints 
associated  with  Af,  to  every  scene  region,  5.  This  provides  a 
first  reduction  of  the  search  space  according  to  unary  con¬ 
straints.  The  algorithm  then  explores  the  remaining  search 
space,  discarding  the  partial  solutions  that  do  not  satisfy  the 
remaining  constraints.  In  other  words,  each  time  a  new  pair¬ 
ing,  (Af;,5),  is  added  to  a  partial  solution, 
(( Afn  , Sy, ) , . . ,  (Afin ,  Sjn )),  the  constraints  associated  with  Af, 
are  evaluated  over  the  new  set  of  pairings.  The  partial  solution 
is  not  explored  further  if  one  of  the  constraints  is  not  satisfied. 
The  result  of  all  the  constraint  evaluations  are  stored  in  tables, 
so  that  a  constraint  is  never  evaluated  twice  on  the  same  set  of 
pairings. 

The  result  of  the  search  algorithm  is  a  small  set  of  solu¬ 
tions.  The  last  step  of  the  recognition  algorithm  is  to  compute 
a  score  that  reflects  the  quality  of  each  solution.  This  last  step 
is  necessary  since  there  is  no  way  of  forcing  the  search  to 
produce  only  one  solution  because  of  near-symmetries  in  the 
model,  segmentation  errors,  or  even  the  presence  of  several 
instances  of  the  object  in  the  scene.  The  actual  computation 
of  the  score  is  discussed  in  detail  in  Section  5.3.  The  next 
sections  describe  the  details  of  the  algorithm.  We  use  a  car  as 
an  example  of  object  model  in  discussing  the  algorithm. 


6.2.  Constraints 

The  constraints  we  currently  use  are: 

•  NX,  NY,  NZ:  constrains  the  components  of  the 
surface  normal  of  a  region.  This  constraint  is 
used  to  implement  natural  limitations  on  the 
orientation  of  an  object,  such  as  "the  roof  of  a  car 
cannot  be  vertical". 

•  Z:  constrains  the  vertical  position  of  a  region. 

•  AREA :  constrain  the  area  of  one  region. 

•  ANGLE:  constrains  the  angle  between  two 
regions. 

•  NEIGHBOR',  constrains  two  regions  to  be  neigh¬ 
bors  by  computing  the  distance  between  the 
boundaries  of  two  regions. 

•  EXCLUDE:  Forbids  two  regions  to  be  visible  at 
the  same  time.  This  constraint  is  based  on  the 
notion  of  aspects7,  An  aspect  is  a  set  of  regions 
that  can  be  observed  from  a  given  viewpoint. 

Instead  of  explicitly  enumerating  the  possible 
aspects  of  an  object,  the  EXCLUDE  constraint 
describes  them  implicitly. 

•  DISTANCE:  constrains  the  distance  between  two 
surfaces. 

The  constraints  are  precomputed  and  stored  in  the  model. 
Each  constraint  is  described  by  the  following  structure: 

•  number  of  arguments,  N:  For  example,  the 
constraint  ANGLE  which  constrains  the  angle  be¬ 
tween  two  regions  has  two  arguments.  The  max¬ 
imum  number  of  arguments  is  currently  three. 

•  evaluation  function,  F:  A  function  that  returns 
an  interval  given  N  regions.  For  example,  the 
constraint  ANGLE  computes  an  interval  centered 
around  the  angle  between  two  input  regions.  The 
size  of  the  interval  is  determined  at  run-time  by 
the  uncertainty  on  the  parameters  of  the  regions. 

In  the  case  ANGLE,  the  interval  width  is  given 
by  the  angular  variance  Ou  within  the  two  input 
regions. 

•  interval,  /:  An  interval,  or  set  of  intervals,  which 
must  intersect  the  computed  interval  to  satisfy  the 
constraint. 

This  representation  of  constraints  is  flexible:  A  new  type  of 
constraint  can  be  easily  added  to  a  model  by  simply  defining 
the  appropriate  evaluation  function.  Building  a  new  model  is 
easier  since  the  constraints  are  not  hardcoded  in  the  recog¬ 
nition  program. 

6.3.  Evaluation  or  the  solutions 

One  would  like  to  have  a  recognition  program  that 
generates  only  one  solution  that  is  reported  its  the  recognized 
object  in  the  scene.  Unfortunately,  the  search  algorithm 
generates  many  solutions  Unit  have  to  be  evaluated  in  order  to 
determine  the  best  one.  There  are  three  reasons  why  the 
search  generates  multiple  solutions:  First,  the  constraints  we 
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•  The  mission  requires  the  recognition  of  specific 
objects  given  apriori  models.  In  that  case,  the 
mere  knowledge  of  the  existence  and  position  of 
an  object  in  the  world  is  not  sufficient,  we  need  a 
more  detailed  description  of  its  shape. 

We  describe  surfaces  by  a  set  of  connected  surface  patches. 
Each  patch  corresponds  to  a  smooth  portion  of  the  surface  and 
is  approximated  by  a  parameterized  surface.  In  addition  to 
the  parameters  and  the  neighbors,  each  region  has  two  uncer¬ 
tainty  factors:  oa,  and  Od.  Oa  is  the  variance  of  the  angle 
between  the  measured  surface  normal  and  the  surface  normal 
of  the  approximating  surface  at  each  point.  is  the  variance 
of  the  distance  between  the  measured  points  and  the  ap¬ 
proximating  surface.  Those  two  attributes  are  used  in  the 
object  recognition  algorithm. 


Ihe  surface  description  is  obtained  <y  segmenting  the 
range  image  into  regions.  Several  schemes  for  range  image 
segmentation  have  been  proposed  in  previous  work6.  These 
techniques  are  based  either  on  clustering  in  some  parameter 
space,  or  region  growing  using  smoothness  criteria  of  the 
surface.  We  chose  to  combine  both  approaches  into  a  single 
segmentation  algorithm.  The  algoritlun  first  attempts  to  find 
groups  of  points  that  belong  to  the  same  surface,  and  then 
uses  these  groups  as  seeds  for  region  growing,  so  that  each 
group  is  expanded  into  a  smooth  connected  surface  patch. 
The  smoothness  of  a  patch  is  evaluated  by  fitting  a  surface, 
plane  or  quadric,  in  the  least-squares  sense. 


(a)  Range  image. 


(a)  Range  image. 
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(b)  Edges  from  reflectance  image. 


(b)  Edges  from  reflectance  image. 


(c)  Low -resolution  segmentation  (n  =  2). 


(c)  Low-resolution  segmentation  (n  =  2). 


(d)  Final  segmentation. 

Figure  4-2:  Range  image  segmentation 


(d)  Final  segmentation. 

Figurt  4-1:  Range  image  segmentation 


use  are  very  liberal  so  that  the  same  model  can  be  used  for  a 
wide  range  of  scenes,  consequently  false  solutions  are  dif¬ 
ficult  to  avoid.  Second,  the  object  may  have  near-symmetries 
that  lead  to  several  equally  valid  interpretations.  Third,  the 
image  segmentation  being  imperfect,  an  object  region  may  be 
broken  into  several  pieces  in  the  image,  dius  producing 
several  equivalent  solutions. 

Our  approach  to  evaluating  solutions  is  to  first  compute  the 
position  and  orientation,  or  pose,  of  the  object  for  each  solu¬ 
tion,  to  then  generate  a  synthesized,  or  predicted,  range  image 
using  the  estimated  pose,  and  to  finally  correlate  the  predicted 
image  and  the  original  range  image.  We  derive  a  score  from 
the  correlation  measure  between  the  two  images  which  is  used 
to  discard  erroneous  solutions,  and  to  sort  the  other  solutions. 

The  pose  is  calculated  for  each  solution  by  minimizing  the 
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where  R  and  f*  are  the  estimated  rotation  and  translation, 
.model  .  model  ,  scene  scene  ,  , 

it  ■  and  c  i  (resp.  ft .  and  Cj  )  are  the  surface 

normal  and  center  of  region  i  (resp.  j)  of  the  model  (resp. 
scene).  The  summation  is  over  all  the  pairs  (scene  region  j, 
model  region  i). 

In  order  to  compute  the  predicted  image,  we  have  to  com¬ 
pute  the  position  of  each  point  of  the  model  in  image  coor¬ 
dinates,  as  well  as  the  predicted  range.  The  position  in  the 
predicted  range  image  of  a  point  p  on  the  surface  of  the 
model  is  given  by. 

^predicted _atan  (x/ z) 

61  predicted  _afrun  (  y  X  COS  (<(>))/  x) 

r0w=(^-<t»Pred,C'ed)/A<t. 

col=(,9prediCted  -%)/AQ 

d predicted  _  yjxXX  +  yXy  +  zxz 

In  this  equation:  x,y,z  are  the  coordinates  of  the  trans¬ 
formed  point  R  p  +  t,  (row,  cot)  is  the  predicted  location  is 
the  image,  and  ^Predlc,ed  js  the  predicted  range  value  at  that 
location. 

The  correlation  between  predicted  and  actual  range  images 
is  given  by: 


c=X(i 


predicted  ,  original 
■(i.j)  6  PnOa'jX  'dij  ~~dij  1 


where  a-  is  the  area  intercepted  by  pixel  i,j,  and  the  sum¬ 
mation  is  made  over  the  intersection  Pr\0  of  the  object  in  the 
predicted  image,  P,  and  the  object  in  the  observed  range 
image,  O. 

The  correlation  must  be  normalized  to  obtain  a  score  S  that 
is  between  0  and  1 : 

S  =  area(P  r\0)l  area(P  u  O ) 
x  (I  -  Ct (K  x  area(P r\0)))) 

In  this  equation,  K  is  the  maximum  range  difference  al¬ 
lowed  between  predicted  and  observed  images  (independent 
of  the  image).  S  is  normalized  by  area(P rsO) / area(PyJ O) 
to  avoid  problems  when  PnO  is  very  small,  in  which  case 
we  would  give  a  high  score  to  a  very  poor  solution.  We  use 
the  score  S  to  eliminate  false  solution  (typically  S  <  0.5),  and 
to  sort  the  solutions  by  decreasing  score.  Figure  6-1  shows 
the  solution  of  highest  score  found  on  one  image,  the  top 
image  shows  the  superimposition  of  the  recognized  model  and 
the  range  image,  the  bottom  image  is  the  overhead  view  of  the 
superimposition. 


verneaa  view. 


Figure  6-1:  Object  recognition 

6.4.  Results 

We  have  tested  the  object  recognition  program  on  23 
images  of  two  different  objects,  each  image  corresponds  to  a 
different  aspect  of  the  objects,  or  to  a  different  distance  be¬ 
tween  the  object  and  the  sensor.  Figure  6-2  shows  a  sample 
of  the  set  of  images.  The  failure  modes  of  the  program  are  as 
follows: 

•  In  two  cases,  the  program  failed  to  produce  any 
interpretation.  These  cases  occur  when  not 
enough  regions  have  been  extracted  to  perform 
the  recognition. 

•  In  two  cases,  the  program  produced  a  set  of  inter¬ 
pretations,  but  the  correct  solution  was  not  in  the 
top-score  set  of  solutions. 
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•  In  three  cases,  the  program  produced  the  correct 
interpretation  as  part  of  the  top-score  set  of  solu¬ 
tions  but  not  as  the  best  solution.  These  cases 
occur  when  a  near  symmetry  in  the  image  cannot 
be  disambiguated  based  on  the  shape  information 
only,  as  a  result,  both  the  correct  solution  and  its 
symmetrical  have  the  same  score. 


Figure  6-2:  Object  recognition  on  a  sample  set  of  images 


The  object  recognition  algorithm  can  be  improved  in 
several  ways:  The  recognition  of  one  object  requires  an 
average  of  1,500  constraint  evaluations,  most  of  which  are 
rarely  used  for  rejecting  a  partial  solution.  One  optimization  is 
to  order  the  constraints  in  the  model  so  that  the  constraints 


7.  Conclusion 

We  have  presented  a  set  of  techniques  for  producing  3-D 
environment  descriptions  for  an  autonomous  vehicle.  Those 
techniques  and  their  output  descriptions  have  been  designed 
to  fulfill  the  requirements  of  the  major  tasks  of  an 
autonomous  vehicle  such  as  obstacle  avoidance,  landmark 
recognition,  and  map  building.  We  have  conducted  an 
numerous  demonstrations  in  the  cmu  autonomous  vehicle  sys¬ 
tem,  the  Navlar,  for  navigation  and  map  building  applica¬ 
tions. 

We  arc  currently  pursuing  further  research  along  three 
lines:  implementation  on  faster  hardware,  handling  uncer¬ 
tainty,  and  developing  a  more  general  -eheme  for  object 
recognition. 

The  primary  bottleneck  in  the  development  of  an 
autonomous  vehicle  is  the  computation  time  required  by  the 
sensory  modules  such  as  the  range  image  analysis  module. 
We  are  in  the  process  of  porting  the  algorithms  to  a  fast 
systolic  machine,  the  warp10.  The  obstacle  detection  module 
has  been  already  successfully  demonstrated  with  a  cycle  on 
the  order  of  one  second  on  the  warp.  The  second  line  of 
work  is  the  representation  of  uncertainty.  Sensor  measure¬ 
ments,  vehicle  position  estimates,  and  segmentation  process 
induce  errors  in  the  final  3-D  description.  These  errors  can  be 
quantified  and  taken  into  account  in  all  the  range  analysis 
algorithms.  We  plan  to  use  an  explicit  representation  of  the 
uncertainty  building  surface  description1  '■  l2.  Third,  we  plan 
to  develop  a  more  general  scheme  for  object  recognition. 
Third,  we  plan  to  apply  the  object  recognition  algorithm  to  a 
larger  class  of  objects  and  algorithms. 
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ABSTRACT 

One  of  the  most  important  obstacles  standing  in  the  way  of  widespread  use 
of  parallel  computers  for  low-level  vision  is  the  iacl:  of  a  programmin'; 
language  that  can  be  mapped  efficiently  onto  different  computer  architectures 
which  is  suited  for  low-level  vision.  We  present  such  a  language,  and 
demonstrate  its  capabilities  by  comparing  the  performance  of  the  Hughes 
HBA,  the  Carnegie  Mellon  Warp  machine,  and  the  Sun  3  on  a  large  set  of 
low-level  vision  programs.  In  order  to  demonstrate  its  efficiency,  we  also 
compare  performance  of  the  Sun  code  with  routines  of  similar  function 
written  by  professional  programmers. 


1.  Introduction 

Low-level  vision  is  an  area  of  computer  science  that  is  ripe  for  the  use  of 
parallel  computers.  This  class  of  operations  is  easily  parallelizable.  Indeed, 
many  parallel  computers  are  already  being  developed  for  use  at  this  level  of 
vision.  These  computers  offer  enormous  speedup  to  the  developer  of 
computer  vision  algorithms,  since  these  operations  are  so  time-consuming, 
but  software  development  is  necessary  before  they  can  be  used. 

In  pervious  work.  Harney,  Webb,  and  Wu  developed  a  language  called 
Apply  [2]  which  can  generate  efficient  programs  for  a  variety  of  parallel 
machines  given  a  single  source  code.  Apply  therefore  allows  machine 
independent  programming,  for  a  limited,  application-specific,  set  of 
algorithms. 

Apply  has  been  used  to  develop  a  library  of  vision  programs  called  WEB, 
which  includes  routines  for  many  low-level  vision  operations.  Over  130 
programs  exist  in  WEB,  80%  of  which  are  written  in  Apply.  The  Apply 
routines  include  basic  image  operations,  convolution,  edge  detection, 
smoothing,  binary  image  processing,  color  conversion,  pattern  generation, 
and  multi-level  image  processing.  This  library  is  therefore  a  machine- 
independent  software  base  for  low-level  image  processing. 

Because  of  the  machine  independence  of  the  Apply  language,  program:; 
written  m  Apply  can  be  ported  from  one  machine  to  another  simply  by 
recompilation.  Moreover,  the  Apply  compiler  and  the  WEB  library  allow  the 
comparison  of  the  performance  of  vision  machines,  since  the  same  source 
code  will  be  running  on  both  machine,  which  is  the  strongest  possible  basis 
for  comparison  of  two  computers. 

In  this  paper,  we  demonstrate  this  by  studying  the  performance  of  Apply 
on  three  diverse  architectures,  by  examining  the  execution  times  of  programs 
from  WEB.  The  architectures  are  the  Carnegie  Mellon  Warp  machine,  a  100 
MFLOPS  systolic  array  machine  [11;  a  Sun  3/75  workstation;  and  the  Hughes 
Aircraft  Corporation  Hierarchical  Bus  Architecture  (HBA)  [6],  a  MIMD 
computer  specifically  designed  for  image  processing  applications.  These 
architectures  differ  in  the  number  of  processors,  in  the  processor  topology 
and  in  the  underlying  processor,  but  Apply  generates  efficient  code  for  all  of 
them. 

We  establish  a  baseline  of  Apply’s  performance  by  comparing  Apply  code 
with  code  generated  by  hand  for  some  of  the  computers.  Then  we  use 
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processing  co-ordinates. 

Constant  parameters  may  be  scalars,  vectors  or  two-dimensional  arrays. 
They  represent  precomputed  constants  which  are  made  available  for  use  by 
the  procedure.  For  example,  a  convolution  program  would  use  a  constant 
array  for  the  convolution  mask. 

execution  time  as  a  basis  for  evaluating  the  performance  of  Apply,  and  for 
studying  the  suitability  of  these  machines  as  image  processors. 

2.  The  Apply  Language 

Apply  implements  a  simple  data  partitioning  model  for  image  processing 
that  makes  explicit  the  parallelism  of  low-level  vision  algorithms.  When 
using  Apply,  the  programmer  writes  a  procedure  which  defines  the  operation 
to  be  applied  at  a  particular  pixel  location.  The  procedure  conforms  to  the 
following  programming  model; 

•  It  accepts  a  window  or  a  pixel  from  each  input  image. 

•  It  performs  an  arbitrary  computation  without  side-effects. 

•  It  returns  a  pixel  value  for  each  output  image. 

The  idea  of  the  Apply  programming  model  grew  out  of  a  desire  for 
efficiency  combined  with  ease  of  programming  for  a  useful  class  of  low-level 
vision  operations.  After  implementing  Apply,  the  following  additional 
advantages  became  evident 

•  Apply  concentrates  programming  effort  on  the  actual 
computation  to  be  performed  instead  of  the  looping  in  which  it  is 
embedded.  This  encourages  programmers  to  use  more  efficient 
implementations  of  their  algorithms.  For  example,  a  Sobel 
program  gained  a  factor  of  four  in  speed  when  it  was 
reimplemented  with  Apply.  This  speedup  primarily  resulted 
from  explicitly  coding  the  convolutions.  The  resulting  code  is 
more  comprehensible  than  the  earlier  implementation. 

•  Apply  programs  are  easier  to  write,  easier  to  debug,  more 
comprehensible  and  more  likely  to  work  correcdy  the  first  time. 

A  major  benefit  of  Apply  is  that  it  greatly  reduces  programming 
time  and  effort  for  a  very  useful  class  of  vision  algorithms.  The 
resulting  programs  are  also  faster  than  the  programmer  would 
probably  otherwise  achieve. 


7  1.  Details  of  the  language 

Apply  is  designed  for  programming  image  to  image  computations  where 
the  pixels  of  the  output  images  can  be  computed  from  corresponding 
rectangular  windows  of  the  input  images.  The  essential  feature  of  the 
language  is  that  each  operation  is  written  as  a  procedure  for  a  single  pixel 
position.  The  Apply  compiler  generates  a  program  which  executes  the 
procedure  over  an  entire  image.  No  ordering  constraints  are  provided  for  in 
the  language,  allowing  the  compiler  complete  freedom  in  dividing  the 
computation  among  processors. 

Each  procedure  has  a  parameter  list  containing  parameters  of  any  of  the 
following  types:  in,  out  or  constant.  Input  parameters  are  either  scalar 
variables  or  two-dimensional  arrays.  A  scalar  input  variable  represents  the 
pixel  value  of  an  input  image  at  the  current  processing  co-ordinates.  A 
two-dimensional  array  input  variable  represents  a  window  of  an  input  image. 
Element  (0,0)  of  the  array  corresponds  to  the  current  processing  co-ordinates. 
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Each  output  variable  represents  the  pixel  value  of  an  output  image.  The 
final  value  of  an  output  variable  is  stored  in  the  output  image  at  the  current 
2 2.  An  Implementation  of  Sobel  Edge  Detection 
As  a  simple  example  of  the  use  of  Apply,  let  us  consider  the 
implementation  of  Sobel  edge  detection.  Sobel  edge  detection  is  performed 
by  convolving  the  input  image  with  two  3x3  masks.  The  horizontal  mask 
measures  the  gradient  of  horizontal  edges,  and  the  vertical  mask  measures  the 
gradient  of  vertical  edges.  Diagonal  edges  produce  some  response  from  each 
mask,  allowing  the  edge  orientation  and  strength  to  be  measured  for  all 
edges.  Both  masks  are  shown  in  Figure  1 . 

11211  110-11 
|  0  0  0  |  |20-2| 

|  -1  -2  -1  |  1  10-1  I 

Horizontal  Vertical 

Figure  1:  The  Sobel  Convolution  Masks. 

An  Apply  implementation  of  Sobel  edge  detection  is  shown  in  Figure  2. 
procedure  sobel 

(lnlmg  :  in  array  (-1..1,  -1..1)  of  byte 
border  0, 

thresh  :  const  real, 
mag  :  out  real) 


thresh  :  con 
mag  :  out 
is 

horiz,  vert 
begin 

horiz  : “  in 


integer ; 


horiz  :=  inimg(-l,-l)  +  2  *  inimg(-l,0) 

+  inimg(-l,l)  -  inimg(l,-l) 

-  2  *  inimg(l,0)  -  inimg(l,l); 
vert  : «  lnimg (-1 , -1)  +  2  *  inimg(0,-l) 

+  lnlmg (1, -1)  -  inimg(-l,l) 

-  2  *  inimg (0,1)  -  inimg ( 1, 1) ; 
mag  : =  aqrt (REAL (horiz) *RKAL (horiz) 

+  REAL  (vert)  *RZAL(vert) )  ; 
if  mag  <  thresh  then 
mag  : =  0.0; 
end  if; 
end  sobel ; 

Figure  2:  An  Apply  Implementation  of 

Thresholded  Sobel  Edge  Detection. 

First  the  input,  output  and  constant  parameters  to  the  function  are  defined. 
The  input  parameter  inimg  is  a  window  of  the  input  image.  The  constant 
parameter  thresh  is  a  threshold.  Edges  which  are  weaker  than  this  threshold 
are  suppressed  in  the  output  magnitude  image,  mag.  Horiz  and  vert  which 
are  internal  variables  used  to  hold  the  results  of  the  horizontal  and  vertical 
Sobel  edge  operator. 

The  input  image  window  is  also  defined  in  the  declaration  of  inimg.  It  is  a 
3x3  window  centered  about  the  current  pixel  processing  position,  which  is 
filled  with  the  value  0  when  the  window  lies  outside  the  image.  This  same 
line  declares  the  constant  and  output  parameters  to  be  floating-point  scalar 
variables. 

The  computation  of  the  Sobel  convolutions  is  implemented  by  the  straight¬ 
forward  expressions  in  the  body  of  the  function.  These  expressions  are 
readily  seen  to  be  a  direct  implementation  of  the  convolutions  in  Figure  1 . 


3.  The  WEB  Library 

Apply  has  been  used  to  implement  a  large  portion  of  the  WEB  library  of 
vision  programs,  which  is  a  large  library  of  vision  programs  implemented  for 
use  on  the  Carnegie  Mellon  Warp  machine.  The  original  purpose  of  the 
library  was  to  facilitate  vision  programming  on  the  Warp  machine. 

WEB  currently  consists  of  over  130  routines,  80%  of  which  are  written  in 
Apply.  The  rest  are  written  in  W2,  which  is  the  standard  Warp  programming 
language.  All  of  the  local  image-to-image  vision  routines  in  WEB  are 
written  in  Apply;  the  W2  routines  include  non-local  routines  such  as 
histogram,  image  warping,  and  connected  components. 


WEB  is  based  on  the  SPIDER  library  of  FORTRAN  programs  [5],  This  is 
a  subroutine  library,  developed  in  Japan,  for  image  processing  using 
FORTRAN.  Routines  from  SPIDER  will  be  compared  here  in  performance 
with  equivalent  routines  from  WEB  in  order  to  measure  Apply's  performance 
as  a  code  generator  for  Sun. 


4.  Implementations  of  Apply 

Apply  has  been  implemented,  so  far,  on  three  architectures:  the  Carnegie 
Mellon  Warp  machine,  the  Sun  3/75  workstation,  and  the  Hughes  HBA.  In 
all  three  cases,  Apply  is  being  used  regularly  for  programming  low-level 
vision  operations  and  has  been  a  useful  tool  in  making  these  operations  easier 
to  write.  We  describe  the  implementations  in  order  to  give  examples  of  how 
Apply  can  be  implemented,  and  to  show  how  the  Apply  implementor  can 
make  use  of  special  features  of  the  architectures  to  make  Apply  more 
efficient 


4.1.  Apply  on  Warp 

We  implement  Apply  on  Warp  by  the  input  partitioning  method.  On  a 
Warp  array  of  ten  cells,  the  image  is  divided  into  ten  regions,  by  column,  as 
shown  in  Figure  3.  This  gives  each  cell  a  tall,  narrow  region  to  process;  for 
512x512  image  processing,  the  region  size  is  52  columns  by  512  rows.  To 
use  technical  terms  from  weaving,  the  Warp  cells  are  the  “warp''  of  the 
processing;  the  “weft”  is  the  rows  of  the  image  as  it  passes  through  the 
Warp  array. 

The  image  is  divided  in  this  way  using  a  series  of  macros  called  GETROW, 
putrow,  and  COMPUTEROW.  getrow  generates  code  that  takes  a  row  of  an 
image  from  the  external  host  and  distributes  one-tenth  of  it  to  each  of  ten 
cells.  The  programmer  includes  a  GETROW  macro  at  the  point  in  his  program 

(  512 - > 


cccccccccc 

eeeeeeeeee 

1111111111 

11111111115 

9876543210  1 


Figure  3:  Input  Partitioning  Method  on  Warp 

where  he  wants  to  obtain  a  row  of  the  image;  after  the  execution  of  the 
macro,  a  buffer  in  the  internal  cell  memory  has  the  data  from  the  image  row. 

The  GETROW  macro  works  as  follows.  The  external  host  sends  in  the 
image  rows  as  a  packed  array  of  bytes-  for  a  512-byte  wide  image,  this  array 
consists  of  128  32-bit  words.  These  words  are  unpacked  and  converted  to 
floating  point  numbers  in  the  interface  unit.  The  512  32-bit  floating  point 
numbers  resulting  from  this  operation  are  fed  in  sequence  to  the  first  cell  of 
the  Warp  array.  This  cell  takes  one-tenth  of  the  numbers,  removing  them 
from  the  stream,  and  passes  through  the  rest  to  the  next  cell.  The  fust  cell 
then  adds  a  number  of  zeroes  to  replace  the  data  it  has  removed,  so  that  the 
number  of  data  received  and  sent  are  equal. 

This  process  is  repeated  in  each  cell.  In  this  way,  each  cell  obtains  one- 
tenth  of  the  data  from  a  row  of  the  image.  As  the  program  is  executed,  and 
the  process  is  repeated  for  all  rows  of  the  image,  each  cell  sees  an  adjacent 
set  of  columns  of  the  image,  as  shown  in  Figure  3. 

We  have  omitted  certain  details  of  getrow- for  example,  usually  the 
image  row  size  is  not  an  exact  multiple  of  ten.  In  this  case,  the  getrow 
macro  pads  the  row  equally  on  both  sides  by  having  the  interface  unit 
generate  an  appropriate  number  of  zeroes  on  either  side  of  the  image  row. 
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Also,  usually  the  area  of  the  image  each  cell  must  see  to  generate  its  outputs 
overlaps  with  the  next  cell’s  area.  In  this  case,  the  cell  copies  some  of  the 
data  it  receives  to  the  next  cell.  All  this  code  is  automatically  generated  by 
GETROW. 

putrow,  the  corresponding  macro  for  output,  takes  a  buffer  of  one-tenth 
of  the  row  length  from  each  cell  and  combines  them  by  concatenation.  The 
output  row  starts  as  a  buffer  of  S12  zeroes  generated  by  the  interface  unit. 
The  first  cell  discards  the  first  one-tenth  of  these  and  adds  its  own  data  to  the 
end.  The  second  cell  does  the  same,  adding  its  data  after  the  first.  When  the 
buffer  leaves  the  last  cell,  all  the  zeroes  have  been  discarded  and  the  first 
cell's  data  has  reached  the  beginning  of  the  buffer.  The  interface  unit  then 
converts  the  floating  point  numbers  in  the  buffer  to  zeroes  and  outputs  it  to 
the  external  host,  which  receives  an  array  of  512  bytes  packed  into  128  32-bit 
words.  As  with  GETROW,  putrow  handles  image  buffers  that  are  not 
multiples  of  ten,  this  time  by  discarding  data  on  both  sides  of  the  buffer 
before  the  buffer  is  sent  to  the  interface  unit  by  the  last  cell. 

During  GETROW,  no  computation  is  performed:  the  same  applies  to 
putrow.  Warp’s  horizontal  microword,  however,  allows  input, 
computation,  and  output  at  the  same  time,  computerow  implements  this. 
Ignoring  the  complications  mentioned  above,  COMPUTEROW  consists  of  three 
loops.  In  the  first  loop,  the  data  for  the  cell  is  read  into  a  memory  buffer 
from  the  previous  cell,  as  in  GETROW,  and  at  the  same  time  the  first  one-tenth 
of  the  output  buffer  is  discarded,  as  in  putrow.  In  the  second  loop,  ninc- 
tenths  of  the  input  row  is  passed  through  to  the  next  cell,  as  in  GETROW;  at 
the  same  time,  nine-tenths  of  the  output  buffer  is  passed  through,  as  in 
PUTROW.  This  loop  is  unwound  by  COMPUTEROW  so  that  for  every  9  inputs 
and  outputs  passed  through,  one  output  of  this  cell  is  computed.  In  the  third 
loop,  the  outputs  computed  in  the  second  loop  are  passed  on  to  the  next  cell, 
as  in  PUTROW. 


4.2.  Apply  on  the  Sun 

The  Sun  3/75  workstation  is  a  conventional  serial  computer  with  a 
MC68020  microproces:  or,  MC6888 1  coprocessor,  and  a  16MB  main 
memory.  In  image  processing,  all  images  are  stored  in  the  main  memory. 

The  Sun  C  compiler  docs  not  do  sophisticated  analysis  of  index 
expressions  to  simplify  access  to  arrays.  The  Apply  C  compiler  employs  a 
special  technique  called  cyclic-scroll  buffering  here,  which  efficiently  uses 
small  space  and  time  to  buffer  the  rows  of  the  image.  The  technique  allows 
the  kernel  to  be  shifted  and  scrolled  over  the  buffer  with  very  simpler  and 
more  efficient  index  expressions  than  would  be  generated  if  the  image  was 
stored  is  a  large  array. 

The  cyclic-scroll  buffering  technique  which  we  developed  for  Apply  on 
uni  processor  machines  is  described  as  follows.  For  an  NxN  input  image 
which  will  be  processed  with  an  MxM  kernel,  a  buffer  with 
(NtW-l)xMt  (/V-l)  elements  is  required. 
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Figure  4:  Processing  the  first  row  by  the  cyclic-scroll  buffering 


Figure  4  and  5  display  the  column-major  arrangement  for  processing  a 
3x3  kernel.  The  pointers  represent  successive  positions  in  memory.  In 
addition,  we  keep  two  base  pointers  for  the  buffer.  One,  called  row  base, 
points  to  the  first  pixel  of  the  three  rows  of  the  image  and  the  other,  called 
kernel  base,  points  to  the  first  pixel  of  the  kernel.  C  language  subscripting 
can  be  used  to  directly  access  the  elements  of  the  kernel  except  that  the 
indices  of  row  and  column  must  be  exchanged  because  the  rows  of  the 
images  are  stored  in  column-major  order. 

Initially,  we  put  the  first  M  rows  of  the  image,  including  the  border,  into 
the  buffer  in  column-major  order.  When  the  first  kernel  is  processed,  row 
base  points  to  the  first  element  of  the  buffer,  and  kernel  base  points  to  the 
first  element  of  the  window  to  be  processed.  After  the  first  kernel  has  been 
processed,  the  kernel  base  is  incremented  by  M  to  point  to  the  first  pixel  of 
the  next  kernel.  It  is  thus  possible  to  shift  the  kernel  across  the  entire  buffer 
of  data  with  a  cost  of  only  one  addition. 

When  processing  an  entire  row  is  completed,  the  first  row  in  the  buffer 
from  the  row  base  is  discarded  and  the  next  row  of  the  image  is  input  into  the 
discarded  row  with  a  column  displacement  of  one  (i.e.,  beginning  at  the 
second  element).  Then  the  row  base  is  incremented  by  one.  The  purpose  of 
column  displacement  1  is  that  the  input  row  can  be  considered  to  be  the 
row  of  the  buffer  starting  from  the  new  row  base.  Effectively,  the  rolling  is 
done  at  the  same  time.  After  the  kernel  base  is  reset  to  point  to  the  center 
clement  of  the  new  window,  we  can  do  another  row  operation  in  the  same 
way  as  the  first  until  all  the  rows  arc  processed.  Figure  4  and  5  show  the 
processing  of  the  first  and  second  row. 
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KB:  Th*>  element,  pointed  by  kernel  b.ise 
KB:  The  element,  pointed  by  row  base. 


Figure  5:  Processing  the  second  row  by  the  cyclic-scroll  buffering 

For  each  row  operation,  one  more  memory  clement  is  needed  in  the  buffer 
Therefore,  the  total  number  of  the  elements  in  the  buffer  is  supposed  to  be 
Mx(/V+M-l)  +  (/V- 1). 

4.3.  Apply  on  the  Hughes  HBA 

The  Hughes  HBA  is  a  24  processor  MIMD  computer  specifically  designed 
for  image  processing  applications.  The  main  architectural  no'elty  of  the 
HBA  is  its  video  i/o  bus,  a  digital  bus  that  can  broadcast  copies  of  a  digitized 
image  to  all  24  processors  from  an  external  buffer  in  one  frame  time.  Each 
processor  (called  an  IP-  image  processor)  supports  vision  programming  with 
a  general  purpose  CPU  (MC68020  with  a  12  MHz  clock  rate  at  the  time  of 
this  study),  rioaling-point  coprocessor  (MC  68881)  and  ample  memory  (1 
MB).  The  IPs  arc  also  linked  by  a  low-bandwidth  (80  MHz)  communications 
bus  intended  for  message  passing  among  the  IPs  and  the  host  computer. 
Software  support  for  all  levels  of  image  processing  includes  a  high-level  (C) 
language  compiler,  debugger,  and  an  operating  system  (HOPS)  supporting 
video  data  transfers,  message  passing,  memory  allocation,  tracing  and 
downloading  software  from  the  host.  In  support  of  low-level  pixel 
processing  operations  the  HBA  runs  the  Apply  routine 
The  HBA  implementation  of  Apply  uses  the  video  I/O  bus  to  transfer 
image  data  from  a  frame  buffer  to  the  IPs.  Although  it  is  possible  to  transfer 
the  entire  image  to  each  of  the  IPs.  it  is  only  necessary  for  each  IP  to  look  at  a 
part  of  the  image  to  achieve  linear  speedup  of  the  Apply  operation.  An  IP 
receives  data  from  the  video  bus  by  specifying  a  starting  scanlinc  and  a 
number  of  scanlincs  to  be  received.  The  data  on  the  video  bus  appears  in 
scan-sequential  order  so  the  IP  opens  a  DMA  pathway  to  its  local  memory 
when  its  starting  scanlinc  appears  on  the  bus  An  IP  outputs  video  data  in 
much  the  same  way,  by  settina  the  appropriate  control  registers  with  the  start 
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scanline,  number  of  scanlines  to  output,  and  location  of  the  video  data  in  its 
local  memory.  The  set  of  consecutive  scanlines  an  IP  receives  or  transmits  is 
called  a  swath. 

For  the  Apply  operation,  an  input  swath  is  always  at  least  as  large  as  an 
output  swath.  If  the  input  image  is  size  R  rows  by  C  columns,  and  the 
programmer  wants  to  compute  an  output  image  with  an  Apply  operator  using 
N  processors,  each  IP  computes  a  swath  of  S=  floor (R/tf)  rows  by  C  columns. 
If  p  is  a  processor  number  in  the  range  0 .AM,  then  each  output  swath 
extends  from  row  Sp  to  row  5(p+l)— 1.  If  the  Apply  operation  is  defined  over 
a  window  that  extends  from  [Ibr  Jbc ]  to  [ufrr,ufrc],  then  each  processor  p  must 
input  a  swath  that  extends  from  Start=Sp+lbr  to  End=S(p+\y+ubr-l.  If 
Siart<0  the  input  routine  inserts  -Start  rows  of  0's  in  place  of  the  video  data 
and  advances  Start  to  0.  The  case  of  End>R- 1  is  similar. 

Internal  to  each  IP  the  Apply  operation  is  implemented  using  an  Illiffe 
vector  method  [3],  In  the  HBA,  a  scanline  of  pixels  occupies  connective 
memory  locations.  The  Apply  implementation  uses  an  Illiffe  vector  of 
ubr-tbr+\  row  pointers  to  hold  address  offsets  into  scanlines.  To  access  a 
pixel  at  position  [r,c]  relative  to  the  coordinate  frame  of  the  window  the 
Apply  routine  finds  the  value  in  the  memory  address  offset  by  c  from  the 
row  pointer.  Each  time  the  Apply  program  computes  an  output  value  for  a 
window,  the  column  vector  advances  to  the  next  column  of  pixels  When  the 
window  reaches  the  end  of  a  scanline,  Apply  shifts  the  row  pointers  up  and 
sets  the  bottom  value  to  the  starting  address  of  the  next  scanlines  to  be 
processed. 

5.  Apply  Code  Compared  with  Hand-written  Code 

Our  primary  purpose  in  this  paper  is  to  develop  a  comparison  of  different 
parallel  processing  machines  for  vision  using  Apply  as  a  vehicle.  In  order  to 
base  this  comparison  on  solid  ground,  we  must  first  evaluate  Apply's 
performance  compared  with  hand-written  code  for  the  same  machine  If 
Apply  produces  code  that  is  comparable  to  hand-written  code,  then  our 
comparison  will  be  solidly  based,  since  the  code  generated  by  Apply 
represents  the  peak  performance  of  the  machine.  On  the  other  hand,  if  Apply 
code  IS  not  as  good,  then  the  comparison  wiU  not  be  solidly  based;  it  could  be 
argued  that  the  measured  performance  would  not  actually  be  seen  since  the 
user  would  not  use  Apply. 

5.1.  Apply  code  compared  with  SPIDER  code 

We  begin  by  comparing  Apply  performance  on  WEB  routines  with  a  set  of 
routines  of  similar  function  from  the  SPIDER  FORTRAN  library.  The 
SPIDER  library  is  professionally  written  and  distributed,  and  the  code  is  of 
high  quality;  therefore,  this  comparison  pits  Apply's  code  against  the  code  o( 
expert  programmers. 

We  arc  comparing  the  actual  execution  times  (user  time  plus  system  lime) 
of  the  FORTRAN  programs,  called  as  a  subroutine  from  C.  with  execution 
urncs  of  C  programs  generated  by  Apply,  called  in  the  same  way.  The  time 
is  measured  from  the  point  at  which  die  input  images  are  ready  (have  been 
stored  in  the  Sun's  memory)  to  the  point  at  which  the  output  images  are 
ready,  in  both  cases  This  time  does  not  include  the  I/O  time  fnr  the  images 
iiom  disk,  or  the  code  download  time  from  disk  into  the  Sun.  All  limes  arc 
for  512x512  images. 

Figure  6  gives  the  ratios  of  execution  times  for  these  programs,  and  Figure 
7  Shows  the  distribution  of  times  lor  all  programs.  Wc  can  see  from  these 
figures  the  following  phenomena: 

•  The  Apply  programs  arc  generally  faster.  There  arc  four  factors 
that  can  account  for  this:  (1)  Cyclic-scroll  buffering;  (2)  The 
superiority  of  the  Sun  C  compiler  to  the  Sun  FORTRAN 
compiler;  (3)  The  FORTRAN  code  is  written  to  be  readable,  at 
the  expense  of  efficiency;  the  code  generated  by  Apply  need  not 
satisfy  such  a  constraint,  since  the  Apply  input  code  is  quite 
readable.  Apply  can  sacrifice  legibility  for  speed. 

•  In  some  cases  such  as  addplr  and  divclt  the  Apply  code  is 

slower.  In  these  programs  the  algorithm  is  processing  a  single  1 

pixel  from  the  input  image  to  produce  a  single  pixel  in  the  output  c 

image.  The  cyclic  scroll  buffering  technique  introduces  a  t 

significant  overhead  in  this  case.  (The  same  docs  not  apply  for  s 

addplb  and  addclb  since  here  the  FORTRAN  code  is  processing  p 

integer  images,  while  the  Apply  program  is  processing  byte 
images.  Thus,  these  programs  are  not  strictly  comparable). 


•  Apply  has  some  limitations  in  its  programming  model  that  affect 
performance.  In  the  FORTRAN  subroutines,  it  is  common  to 
write  several  different  ways  of  computing  the  output  depending 
on  switches.  There  is  little  overhead  for  this  in  FORTRAN  since 
the  code  can  be  written  so  as  to  test  the  value  of  the  switches 
once  per  image.  In  Apply,  the  equivalent  code  would  test  the 
value  of  the  switch  once  per  pixel,  since  the  Apply  procedure  is 
executed  in  its  entirety  for  every  pixel.  This  can  limit 
5.2.  Apply  code  compared  with  W2  code 

Next  we  compare,  performance  on  the  Carnegie  Mellon  Warp  machine. 
This  machine  is  programmed  by  hand  in  W2,  a  Pascal-level  language  in 
which  the  user  is  explicitly  aware  of  the  different  processors  and  the 
communication  between  them,  (Send  and  receive  statements  are  used  to  send 
words  of  data  between  cells). 

While  many  programs  have  been  and  continue  to  be  written  for  Warp  in 
W2,  the  availability  of  Apply  has  significantly  eased  the  programmer's 
burden.  Apply  hides  the  explicit  parallelism  of  cells,  the  number  of  cells  in 
use,  and  the  communications  between  cells  from  the  programmer.  This  has 
made  it  possible  to  develop  WEB  for  Warp;  without  Apply,  it  is  doubtfi  1  that 
such  a  library  could  have  been  built 

Figures  8  and  9  give  the  performance  for  hand- written  W2  programs 
compared  with  WEB  programs  of  equivalent  function.  The  times  are 
measured  from  the  moment  the  input  data  is  available  for  processing  in  the 
external  host  memory  by  the  Warp  array  to  the  moment  the  output  data  is 
stored  into  the  external  host  memory.  All  times  are  for  512x512  images. 
There  are  three  main  phenomena  responsible  for  the  wide  distribution  of 
execution  time  ratios; 

•  Some  programs,  such  as  egrel,  egpw3,  and  cgpw4,  are  much 
slower  in  W2  than  in  Apply.  This  is  because  the  W2 
programmer  made  less  optimizations  to  the  code  (such  as 
unrolling  innermost  loops)  than  the  Apply  programmer.  The 
Apply  programs  are  smaller,  and  do  not  include  statements  for 
I/O,  so  that  the  programmer's  effort  is  focused  on  making  the 
heart  of  his  code  as  efficient  as  possible. 

•  The  Apply-gcncrated  programs  consistently  overlap  I/O  with 
computation  on  the  cell,  while  the  W2  programs  do  not.  This  is 
because,  while  it  is  possible  to  communicate  between  cells  in  the 
same  Warp  microinstruction  where  computation  is  done,  doing 
so  involves  some  careful  placement  of  I/O  statements  which  can 
be  hard  for  the  programmer. 

•  Some  programs  are  slightly  faster  in  W2  than  in  Apply.  This  is 
because  of  the  limitation  of  the  Apply  programming  model 
discussed  earlier;  the  W2  programmer  can  initialize  state  based 
on  the  values  of  global  variables,  and  avoid  retesting  switches, 
etc.,  during  processing  of  the  images,  while  the  Apply 
programmer  cannot  do  this. 

Here  wc  sec  two  principal  effects  of  the  Apply  language  on  Warp 
programming:  (1)  The  Apply  program*  are  simpler  and  easier  to  write,  so 
the  programmer  makes  them  more  efficient,  and  Apply  in  turn  generates 
better  code  because  it  can  deal  with  the  machine  complexity  better;  (2)  The 
limitation  of  the  Apply  programming  model  for  preprocessing  data  can  lead 
to  some  loss  of  performance. 

6.  Comparison  of  Diverse  Architectures 

It  is  very  rare  that  widely  different  computer  architectures  are  compared 
directly  for  performance;  the  best  previous  examples  have  been  FORTRAN 
studies  of  supercomputer  performance  (4],  These  have  depended  on  the 
implementation  of  a  large  language  designed  for  use  on  sequential 
computers,  and  so  have  been  limited  to  those  computers  in  which  significant 
software  development  has  occurred  to  bring  up  FORTRAN,  and  which  are 
suitable  for  impli  n  -  cation  of  a  sequential  computer  language 
The  comparisons  presented  here  differ  from  these  because  .he  , 
language  is  designed  for  use  on  parallel  computers ,  so  that  a  ^der  range  of 
computers  can  be  compared,  and  because  Apply  is  application  sS  1 

vs l"  “  anii  d0CS  n°‘  rcquirc  an  enon"°us  effort  ro  bang  up  on  a  new 

sy.  tern.  Thus,  wc  arc  able  to  directly  compare  the  Sun  V75 
Mellon  Warp  machine,  and  the  Hughes  UBAP  '  ^  C  g,c 
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6  1.  Warp  Compared  with  Sun 

’Figures  10  and  11  give  the  performance  of  a  large  number  of  programs 
implemented  both  on  the  Sun  3/75  computer  and  the  Warp  machine.  This 
allows  us  to  evaluate  the  Warp's  performance  for  image  processing 
compared  with  a  high- performance  workstation. 


Warp  execution  time  was  measured  from  the  point  at  which  the  arrays  of 
input  data  were  available  in  Warp’s  external  host  to  the  pornt  at  which  the 
anavs  of  output  data  became  available.  This  is  consistent  with  the 
measurement  method  for  the  Sun  3/75.  Code  download  tune  was  not 
included.  All  times  are  for  512x512  images. 


>  In  the  fmin  and  fmax  algorithms,  which  compute  the  minimum 
and  maximum  of  a  3x3  window  around  each  pixel,  the  HBA 
time  is  slightly  better  than  the  HBA  time.  This  is  because  in 
such  a  highly  conditional  operation,  the  use  of  the  long  pipeline 
inside  the  Warp  cell  is  a  barrier  to  good  performance.  Moreover, 
the  total  number  of  ALU  operations  in  the  HBA  is  greater,  since 
there  are  24  processors  instead  of  10,  of  comparable  integer 
performance. 


We  observe  the  following  from  these  data: 

•  There  are  a  few  cases  where  Warp’s  performance  far  exceeds 
expectations:  ir.  the  case  of  egfc  and  egks2,  for  example.  Based 
on  the  comparative  floating  point  rates,  we  would  expect  Warp  s 
performance  to  be  one  to  two  hundred  times  that  of  the  Sun,  but 
here  the  execution  times  is  666  and  304  times  less  than  the  Sun. 
These  large  factors  are  due  to  the  internal  parallelism  of  the 
Warp  cell;  it  consists  of  many  independent  units,  which  can  be 
individually  controlled  with  a  wide  horizontal  microinstruction. 

In  the  best  case,  a  Warp  cell  can  do  I/O  with  other  cells,  read  and 
write  memory,  compute  an  integer  ALU  operation,  and  compute 
a  floating  point  add  and  a  floating  point  multiply,  all  in  the  Mine 
200  nanosecond  cycle.  The  success  of  this  design  (and  of  the 
compiler  in  packing  instructions  together)  is  shown  in  the  rauos 
for  egfc  and  egks2. 

.  in  the  majority  of  cases,  the  execution  time  ratio  is  tens  of  times 
the  Sun  3/75.  (The  average  rano  is  67,  with  a  median  of  40). 
This  reflects  the  raw  processing  power  of  Warp  combined  with 
the  effects  of  the  applications  mix  (which  includes  a  large 
amount  of  integer  processing)  and  the  efficiency  of  the  Warp 
compiler. 

•  In  some  cases,  the  ratio  is  ten  or  less.  In  these  cases,  the  Apply 
program  cannot  make  use  of  Warp's  highly  pipelined  floating 
point  units,  because  of  a  large  amount  of  conditional  branching 
within  the  program,  and  also  because  die  computation  is  mainly 
additions,  so  that  the  separate  multiplier  cannot  be  used.  Here 
we  are  seeing  the  effects  of  using  a  highly  pipelined  machine  to 
implement  what  is  essentially  a  scalar  operation.  The  multiple 
independent  Warp  cells  can  still  be  used  effectively,  since  the 
computation  is  independent  between  each  cell;  but  the  pipelining 
of  the  computation  within  the  cell  is  not  successful. 


7.  Conclusions 

This  is  the  first  study  which  we  are  aware  of  in  which  highly  diverse 
architectures  have  been  compared  using  the  same  source  code.  We  can  make 
several  conclusions  based  on  this  study: 

•  Apply  and  WEB  are  clearly  good  tools  for  comparing  these 
architectures.  Quite  apart  from  the  utility  of  having  Apply  and 
WEB  available  on  an  architecture,  which  is  considerable,  using 
the  same  source  code  eliminates  many  factors  that  significantly 
affect  performance  but  which  are  irrelevant  to  performance 
analysis.  Apply  is  easy  to  implement  on  a  parallel  processor, 
which  makes  it  possible  to  evaluate  the  performance  of  a  large 
number  of  parallel  machines  with  little  effort.  We  look  forward 

to  evaluating  other  parallel  architectures  in  the  future. 

•  The  main  performance  limitation  of  the  Apply  programming 
model  is  the  inability  to  manipulate  constant-type  parameters 
once  per  image  rather  than  once  per  pixel.  This  deficiency  will 
have  to  be  corrected  in  future  versions  of  Apply  or  its 
successors. 


•  Most  of  the  performance  limitations  of  one  architecture  over 
another  arise  from  intra-processor  characteristics,  rather  than 
inter-processor  characteristics.  This  is  because  this  level  of 
vision  is  easy  to  parallelize,  so  that  different  processors  need  to 
communicate  very  little.  Much  more  significant  is  the  ability  of 
the  processor  to  successfully  implement  the  wide  range  of 
operations  that  is  required  in  low-level  vision,  including  integer, 
floating  point,  and  conditional  operations. 


•  Parallel  processors  deliver  performance  increases  even  over 
high-performance  workstations  at  this  level  of  vision,  and  Apply 
makes  them  no  harder  to  program.  The  performance  ratios  vary 
from  ten-  to  hundred-fold  increases.  This  is  a  significant.  cost- 
effective,  performance  increase. 
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processing  in  the  frame  buffer  of  the  HBA  to  the  time  the  output  image  is 
stored  there.  At  the  time  of  this  study,  the  HBA  processed  240x256  images; 
to  be  consistent  with  the  Warp  times,  the  HBA  times  have  been  multiplied  by 
4.27.  The  Warp  times  are  for  512x512  images,  measured  as  before. 
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Figure  *:  Ratio  of  execution  tunes  of  hand-generated  W2  code 
to  Apply  code. 

Vertical  line  indicates  a  ratio  of  one. 
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Figure  6:  Ratio  of  execution  times  of  hand-generated  SPIDER 
FORTRAN  to  Apply  code. 

Vertical  line  indicates  a  raoo  of  one. 
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Figure  9:  Scatter  diagram  of  execution  times  of  hand-generated  W2  code 
and  Apply  code. 

Diagonal  line  indicates  equality. 
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Figure  7:  Scaoer  diafnin  of  execution  limes  of  hand-generated  SPIDER 
FORTRAN  and  Apply  code. 

Diagonal  line  sndicatea  equality. 
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Abstract 

In  this  paper,  we  summarize  the  activities  at  USC  in 
developing  parallel  architectures  and  parallel  techniques 
for  various  image  and  vision  problems.  The  proposed  ar¬ 
chitectures  include  the  enhanced  mesh,  the  reconfigurable 
mesh  and  the  mesh  of  meshes.  These  architectures  are 
illustrated  by  parallelizing  known  techniques  and/or  de¬ 
veloping  new  parallel  algorithms  for  image  processing  and 
vision.  We  develop  efficient  algorithms  to  support  data 
routing  among  the  processors.  By  using  these  routing  tech¬ 
niques,  we  solve  a  variety  of  problems  related  to  digitized 
images,  as  well  as  problems  in  image  understanding  using 
linear  segments  as  primitives. 


1  Introduction 

Image  processing  and  vision  tasks  are  usually  computa¬ 
tionally  intensive.  A  variety  of  techniques  have  been  devel¬ 
oped  for  efficient  solutions  to  such  problems,  from  low  level 
processing  of  digitized  images,  to  high  level  processing,  as 
in  image  understanding.  Many  parallel  architectures  have 
been  proposed  to  efficiently  implement  these  techniques 
([12],  [13],  [7], [14],  [15],  [57],  [64],  [32],  [33],  [34],  [49],  [47], 
[4],  [62]). 

In  this  paper,  we  summarize  the  activities  at  USC  in 
developing  parallel  architectures  and  parallel  techniques 
for  various  image  and  vision  problems.  The  proposed  ar¬ 
chitectures  include  the  enhanced  mesh,  the  reconfigurable 
mesh  and  the  mesh  of  meshes.  These  architectures  are 
illustrated  by  parallelizing  known  techniques  and/or  de- 
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veloping  new  parallel  algorithms  for  image  processing  and 
vision.  We  develop  efficient  algorithms  to  support  data 
routing  among  the  processors.  By  using  these  routing  tech¬ 
niques,  we  solve  a  variety  of  problems  related  to  digitized 
images,  as  well  as  problems  in  image  understanding  using 
linear  segments  as  primitives. 

In  the  next  section,  we  address  architectures  and  al¬ 
gorithms  for  information  extraction  from  digitized  images. 
The  2-dimensional  mesh  connected  computer  is  a  well  stud¬ 
ied  parallel  architecture  for  solving  such  problems  [51,32], 
The  mesh  organization,  from  a  hardware  point  of  view,  is 
superior  to  other  models  with  respect  to  modularity,  regu¬ 
larity  and  its  low  interconnection  cost.  A  prohibiting  factor 
in  achieving  large  speedups  using  the  mesh  is  its  diameter, 
which  is  n(JV1/2)  for  a  N1^  x  N1?1  mesh.  The  proposed 
models  include  modifications,  which  alleviate  the  fl( N */*) 
running  time  performance  of  the  mesh.  For  example,  the 
addition  of  broadcast  buses  to  the  mesh  results  in  the  en¬ 
hanced  mesh  having  time  performance  comparable  to  that 
of  a  pyramid  for  many  image  problems.  We  illustrate  el¬ 
egant  parallel  solutions  on  the  mesh  of  trees,  an  organi¬ 
zation  not  well  studied  by  researchers  in  image  processing 
and  vision.  We  also  show  a  related  architecture,  the  mesh 
of  meshes.  On  this  model  we  can  achieve  linear  speedup 
in  solving  several  image  problems,  while  the  number  of 
processors  can  vary  over  a  wide  range.  A  more  powerful 
architecture,  the  reconfigurable  mesh,  seems  to  be  a  uni¬ 
versal  array  for  image  computations,  providing  the  ability 
to  simulate  both  the  pyramid  and  the  mesh  of  trees  orga¬ 
nization.  Moreover,  on  the  reconfigurable  mesh,  superior 
asymptotic  time  performance  can  be  obtained,  compared 
to  the  best  known  performance  on  the  pyramid  and  the 
mesh  of  trees  organizations. 

Over  the  past  decade,  several  techniques  have  been 
developed  at  USC  for  problems  in  image  understanding. 
Among  these,  using  linear  segments  as  primitives  [29,30] 
has  been  well  studied.  In  section  III  of  the  paper  we  con¬ 
sider  parallelizing  such  techniques.  We  develop  an  input 
partitioning  strategy  for  the  parallel  solution  of  the  image 
matching  and  the  stereo  matching  problems  using  linear 
segments  as  input.  This  leads  to  a  parallel  implementa- 


tion  of  such  algorithms  achieving  linear  speedup  on  a  mesh 
connected  computer.  We  also  present  a  systolic  implemen¬ 
tation  of  these  algorithms.  This  achieves  a  speedup  of 
0{k'!')  on  a  systolic  array  with  k  +  k */2  cells,  4  <  k  <  m, 
where  m  is  the  number  of  linear  segments  in  the  image. 

Details  of  the  work  presented  here  can  be  found  in 
[4,44,47,48,49].  We  thank  Hussein  Alnuweiri  and  Mary 
Eshaghian,  for  sharing  their  contributions  in  this  paper. 


2  Parallel  Architectures  and  Al¬ 
gorithms  for  Image 
Computations 

In  this  section  we  present  parallel  architectures  and  algo¬ 
rithms  for  solutions  to  problems  related  to  digitized  im¬ 
ages.  The  input  to  these  algorithms  is  a  digitized  image 
of  size  N1'*  x  N1/2.  Connected  I’s  through  neighbor  to 
neighbor  connections  are  called  figures  [50]. 

In  all  our  analysis  we  make  the  following  assumptions: 
each  processor  can  perform  arithmetic  and  logic  opera¬ 
tions  in  O(l)  time.  Each  processor  has  a  constant  number 
of  registers  serving  as  local  memory.  A  register  can  hold 
0(log 7V)-bit  data.  A  packet  (record),  which  consists  of 
0(log  N)  bits,  can  be  transmitted  using  a  connection  link 
between  two  processors  in  0(1)  time. 

2.1  Enhanced  Mesh 

The  enhanced  mesh  of  size  N1^1  x  TV1/2  is  based  on  a  2- 
dimensional  mesh  connected  computer  (2-MCC)  of  size 
TV1/2  x  TV1/2.  Each  processor  is  placed  at  grid  points  (t,  j), 
0  <  t,  j  <  N l/2  —  1.  In  addition  to  the  standard  2-MCC 
organization  we  allow  broadcast  in  each  row  and  each  col¬ 
umn.  In  other  words,  each  processor  in  a  row  (column) 
has  a  port  connected  to  the  corresponding  row  (column) 
broadcast  bus.  During  a  broadcast  cycle,  among  the  N 
processors  sharing  a  bus,  at  most  one  processor  can  write 
onto  the  bus.  All  the  processors  sharing  a  bus  can  read 
from  the  bus.  An  enhanced  mesh  of  size  4  x  4  is  shown  in 
figure  1.  Using  the  broadcast  bus  a  processor  can  transmit 
a  record  to  another  processor  in  the  same  row  (column) 
in  0(1)  time.  Notice  that  such  a  scheme  reduces  the  time 
needed  for  communication  between  any  two  processors  to 
0(1).  A  clever  use  of  the  broadcast  buses  is  of  significant 
assistance  in  developing  fast  operations  among  data  [43], 
A  set  of  such  operations  which  makes  the  enhanced  mesh 
an  attractive  scheme  for  implementing  image  understand¬ 
ing  algorithms  is: 

1.  Any  associative  binary  operation  among  N  data  items 
can  be  performed  in  0(N1^e)  time. 


mesh  connection 


_  bus  connection 

Figure  1:  Enhanced  mesh 


2.  AU/J  data  stored  one  item  per  processor  in  a  row  (col¬ 
umn)  can  be  sorted  in  0(log2JV)  time  and  an  associa¬ 
tive  binary  operation  on  these  data  can  be  performed 
in  0(log  N)  time. 

Data  routing  can  be  accomplished  by  the  following  op¬ 
erations  developed  for  SIMD  machines  [37]:  the  Random 
Access  Read  (RAR)  and  Random  Access  Write  (RAW). 
These  operations  can  be  informally  described  as  follows: 
each  processor  PEij,  0  <  t,j  <  N1/1  —  1,  has  a  data 
D(i,j)  and  an  address  Aij.  The  address  corresponds  to 
the  index  of  a  processor  in  the  array.  In  RAR,  PEiti  is 
to  read  the  data  D(Aij).  In  RAW,  each  PE<  j  sends  its 
data  to  the  processor  pointed  to  by  its  address  register  (de¬ 
tails  can  be  found  in  [37]).  These  operations  take  0(W^2) 
time  on  the  2-MCC  as  well  as  on  the  enhanced  mesh.  In 
cases  where  the  distribution  of  source  and  destination  pro¬ 
cessors  is  sparse,  the  broadcast  buses  can  support  signifi¬ 
cantly  faster  data  movement.  Such  sparse  data  movement 
is  often  needed  in  parallel  solutions  to  image  problems. 
Sparsity  conditions  that  allow  fast  data  movement  using 
the  buses  are: 

1.  In  any  block  of  size  k  x  k,  1  <  k  <  JV1/2  —  1,  the 
number  of  source  and  destination  processors  is  <  k. 
Then,  random  access  read  and  write  can  be  per¬ 
formed  in  OffV1/6)  time  on  the  enhanced  mesh  [47], 

2.  In  any  block  of  size  k  x  k,  1  <  k  <  N^A,  the  number 
of  source  and  destination  processors  is  <  k.  In  this 
case,  random  access  read  and  write  can  be  performed 
in  0(N1!*)  time  [47]. 
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Algorithms  for  processing  digitized  images  on  the  en¬ 
hanced  mesh  Me  asymptotically  much  faster  compared  to 
those  on  the  mesh.  In  many  instances  the  performance 
of  the  algorithms  on  the  enhanced  mesh  is  comparable  to 
those  on  the  pyramid  computer  of  the  same  size  [47].  We 
outline  an  algorithm  for  labeling  all  the  figures  in  an  image, 
as  an  illustration  of  the  use  of  broadcast  buses. 

The  idea  here,  is  to  divide  the  entire  mesh  into  blocks  of 
size  k  x  k  and  label  the  parts  of  the  figures  which  are  lying 
inside  each  block  for  all  the  blocks  in  parallel.  Blocks  are 
merged  in  parallel  to  yield  new  labels  in  larger  size  blocks, 
by  using  the  connectivity  information  along  the  boundary 
of  adjacent  blocks.  Figures  in  adjacent  blocks  are  merged 
as  follows: 

Assume  that  the  MCC  is  divided  into  blocks  of  size  kxk 
and  all  the  figures  inside  each  block  are  labeled.  To  relabel 
the  figures  of  two  neighboring  blocks  we  use  the  following 
representation.  Each  label  corresponds  to  a  vertex.  Let  p 
and  q  be  nodes  of  the  graph  with  corresponding  labels  lp 
and  lq.  Suppose  we  are  merging  two  blocks  horizontally. 
Then,  an  edge  between  the  nodes  p  and  q  exists  in  the 
graph,  if  and  only  if  there  exists  a  pair  PEX  ]  and  PE,  j+i  at 
the  boundary  between  two  blocks,  such  that  PEij  has  label 
fp  and  PEij+i  has  label  /,.  To  compute  the  connectivity  of 
the  figures  belonging  to  two  adjacent  blocks,  we  examine 
the  connected  components  of  the  graph  defined  above.  The 
input  to  the  algorithm  during  each  iteration  is  a  set  of  m 
(m  <  k)  edges  for  each  pair  of  adjacent  blocks  of  size  kxk. 
The  input  data  set  is  stored  in  a  linear  array  formed  by 
the  border  processors  of  the  two  blocks  to  be  merged.  The 
size  of  the  array  is  k.  Blocks  are  merged  in  three  phases 
as  follows  [47]: 

In  the  first  phase,  we  merge  blocks  using  the  mesh  con¬ 
nections,  until  the  block  size  is  N 't*  x  N 1/f4.  This  takes 
0(N't*)  time. 

We  implement  the  second  phase  as  follows:  In  each 
block  of  size  l  x  l,  we  move  the  l  edges  of  the  graph  in  a 
block  of  size  %/7  x  y/I.  Four  adjacent  blocks  of  size  /  x  l  can 
be  merged  in  0(%/7logl)  time.  During  the  ith  iteration  of 
the  second  phase,  merging  four  blocks  of  size  l  x  l,  where 
N't*  <  l  <  (IVl/2/log2IV),  takes  0(2‘Nl/e  log  IV)  time  for 
the  connected  components  algorithm  and  0(N'l*  / 2')  time 
for  communication  on  the  mesh  by  using  the  broadcast 
buses,  0  <  »  <  log(7VI/8/log2IV).  The  entire  second  phase 
takes  0(1V1/4)  time. 

The  last  phase  of  the  algorithm  starts  when  the  size 
of  the  block  is  (IV1/4/log4IV)  x  (IV1/4/log4IV).  Now,  there 
are  N't2 /log8 N  buses  available  to  each  block.  There  are 
2N't2 /log4 TV  edges  (in  each  block)  to  be  used  for  the  con¬ 
nected  components  algorithm.  Using  a  well  known  tech¬ 
nique  [21|  for  computing  connected  components,  the  last 
merging  phase  can  be  completed  in  0(log9lV  log  log  AT)  time 
using  the  broadcast  buses.  This  leads  to: 

Theorem  2.1  On  an  enhanced  mesh  of  size  N 't2  x  N't2, 


given  a  N't2  x  N't2  digitized  image,  all  figures  can  be  la¬ 
beled  in  0(N '/*)  time. 

Notice  that  the  labeling  problem  takes  0(N '!*)  time 
on  a  pyramid  computer  of  base  N 't2  x  N't2  [31]. 

Using  the  buses  to  perform  sparse  data  movement,  a 
closest  figure  to  each  figure  can  be  computed  [47] : 

Theorem  2.2  On  an  enhanced  mesh  of  size  N x  N't2, 
given  a  N't2  x  N1/2  digitized  image,  a  nearest  figure  to  each 
figure  can  be  computed  in  0(N't*)  time. 

The  above  computation  takes  0(N 't*)  time  on  a  pyra¬ 
mid  of  base  IV */*  x  IV1/2  [31].  Buses  can  be  used  to  effi¬ 
ciently  compute  geometric  properties  of  images.  Using  the 
sparse  data  movement  techniques,  it  is  possible  to  show 

[47]: 

Theorem  2.3  On  an  enhanced  mesh  of  size  N't2  x  N't2, 
the  geometric  properties  of  the  convex  hull  of  a  single  fig¬ 
ure  (such  as  its  diameter,  a  smallest  enclosing  box  and 
a  smallest  enclosing  circle)  can  be  computed  in  0(N't6) 
time. 

As  a  comparison,  notice  that  the  computation  of  the 
above  geometric  properties  takes  0(N'te)  time  on  a  pyra¬ 
mid  of  base  N't2  x  N'/2  [31].  Further  on  image  computa¬ 
tions  on  the  enhanced  mesh  can  be  found  in  [47]. 

2.2  Mesh  of  Trees  Organization 

In  the  Mesh  Of  Trees  organization  (MOT)  a  processor  is 
placed  at  grid  points  of  a  N't2  x  N't2  mesh.  Each  row  and 
each  column  of  processors  forms  the  leaves  of  a  complete 
binary  tree  [27,38].  The  root  and  the  internal  nodes  of  each 
binary  tree  are  also  processors.  The  N  leaf  processors  form 
the  base  of  the  network  and  they  are  called  base  processors. 
A  4  x  4  MOT  is  shown  in  figure  2.  In  most  cases  the 
computations  on  the  mesh  of  trees  are  done  by  the  base 
processors.  The  upper  level  processors  are  used  mainly  for 
communication. 

The  MOT  organization  has  been  well  studied  in  the 
context  of  VLSI  computations  [63].  However,  not  much 
work  has  been  done  in  using  this  organization  for  image 
computations. 

The  MOT  can  efficiently  implement  divide  and  con¬ 
quer  techniques  for  image  problems.  As  an  illustration,  we 
present  an  algorithm  for  computing  the  convex  hull  of  all 
the  figures  in  an  image  [44],  The  time  performance  of  the 
algorithm  can  be  further  improved.  However,  in  this  paper 
we  will  illustrate  the  basic  technique. 

Assume  that  the  entire  image  has  been  divided  into  dis¬ 
joint  blocks  of  size  kxk.  By  merging  four  adjacent  blocks 
of  size  k  x  k  we  obtain  the  convex  hull  of  each  figure  inside 
blocks  of  size  2k  x  2k.  This  is  done  for  all  the  figures  in 
parallel.  Convex  hulls  located  in  two  adjacent  blocks  are 
merged  if  they  belong  to  the  same  figure.  Merging  can  be 
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accomplished  by  detecting  the  two  common  tangent  lines 
to  the  convex  polygons  [41].  This  is  done  by  perform¬ 
ing  a  binary  search  on  the  extreme  points  of  each  convex 
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Figure  2:  Mesh  of  Trees  Organization 

hull.  The  merging  operation  of  four  adjacent  blocks  takes 
0(logsk)  time,  leading  to: 

Theorem  2.4  On  a  Ni/2  x  Nl/2  MOT,  the  convex  hull  of 
all  the  figures  can  be  computed  in  0(\og*N)  time. 

Several  geometric  properties  of  a  single  figure  can  be 
easily  computed  on  the  MOT.  For  example: 

Theorem  2.5  The  convex  hull  of  a  single  figure,  its  di¬ 
ameter,  a  smallest  enclosing  box  and  a  smallest  enclosing 
circle  of  the  figure  can  be  computed  in  0(log  N)  time  on  a 
N »/*  x  IV*/*  MOT. 

Mesh  Of  Trees  is  an  efficient  parallel  organization  for 
problems  with  sparse  information  exchange.  For  example, 
/V'/2  data  items  can  be  moved  from  a  column  to  another 
in  0(log  N)  time,  while  such  an  operation  takes  0 (IV1/4-*) 
time  (in  the  worst  case)  on  a  pyramid  of  base  N x  N1^2 
[31].  Such  a  feature  is  very  attractive  for  solving  several 
image  problems  [44]. 

2.3  Mesh  of  Meshes 

The  Mesh  of  Meshes  (MOM)  organization  has  a  reduced 


number  of  processors  compared  to  other  parallel  architec¬ 
tures  studied  for  problems  on  digitized  images.  In  ad¬ 
dition,  a  novel  interprocessor  communication  scheme  is 
supported  on  the  MOM;  processors  communicate  through 
shared  memory  locations  as  well  as  direct  interprocessor 
links.  Using  this  feature,  optimal  speed  up  for  a  large 
number  of  problems  in  image  processing  cm  be  obtained 
[2,3]. 

The  (MOM)  organization  consists  of  p  processors  and 
n2  memory  locations.  The  MOM  is  organized  as  an  array 
of  k  x  k,  1  <  k  <  n,  basic  modules  (BMs).  Each  BM 
is  a  parallel  organization  consisting  of  q  processors,  where 
1  <  q  <  n/k,  and  a  q  x  q  array  of  memory  modules.  The 
memory  size  of  each  BM  is  n2/k 2.  Within  a  BM,  each  pro¬ 
cessor  has  access  to  one  row  and  one  column  of  the  memory 
modules  in  the  BM.  Also,  each  processor  is  connected  to 
the  four  corresponding  processors  in  the  four  neighboring 
BMs.  Note  that  the  total  number  of  processors  in  the 
MOM  is  p  =  k2q,  and  the  total  number  of  memory  mod¬ 
ules  is  k2q2  (i.e.  a  total  of  n2  memory  locations).  Such 
an  organization  is  shown  in  figure  3  for  q  =  4.  The  Mesh 
of  Meshes  provides  optimal  speedup  for  several  computa¬ 
tions  on  digitized  images  when  p  is  in  the  range  1  to  n3^2. 
When  p  =  n  the  performance  of  MOM  is  comparable  to 
the  2-MCC  of  size  n  X  n  for  n  x  n  image  problems.  Notice 
that  0(n)  is  the  lower  bound  on  time  performance  for  any 
non  trivial  problem  on  the  2-MCC  [36].  Most  of  the  image 
problems  can  be  solved  in  0(n)  time  on  the  MOM  when 
p  =  n. 

An  interesting  result  is  that  on  the  MOM,  n2  elements 
can  be  sorted  in  0(n2  log  n/p)  time,  for  1  <  p  <  n  log  n  [3]. 

Scanning  through  values  stored  in  a  memory  row  or  in 
a  memory  column  can  be  performed  in  0(n/k  +  k)  time. 
Suppose,  a  n  x  n  digitized  image  is  stored  on  the  mesh  of 
meshes,  one  pixel  per  memory  location.  The  convex  hull 
of  a  set  of  l’s  can  be  computed  in  two  steps.  First,  the 
leftmost  and  the  rightmost  l’s  on  each  row  are  identified. 
Second,  for  each  of  the  above  detected  points  we  compute 
the  angles  with  all  the  other  points,  to  decide  if  it  is  an 
extreme  point.  These  two  steps  can  be  performed,  by  scan 
operations. 

Using  scan  of  data  and  efficient  data  reduction  and  data 
movement,  we  can  show  [2]: 

Theorem  2.6  The  mesh  of  meshes  organization  with  p 
processors  can  perform  the  following  computations  on  a 
n  x  n  image  in  0(n2 /p)  time,  1  <  p  <  n3/2.- 

1.  Labeling  all  the  figures  in  the  image. 

S.  Constructing  the  convex  hull  of  a  set  of  pixels. 

S.  Computing  the  diameter  and  a  smallest  enclosing  box 
(circle)  of  a  set  of  pixels. 

4.  Computing  the  closest  figure  to  each  figure. 

When  k  =  n1^2,  the  number  of  processors  in  the  mesh 
of  meshes  is  n 3/2  and  several  image  problems  can  be  solved 
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Figure  4:  Reconfigurable  Mesh 


Figure  3:  Mesh  of  Meshes 


in  0(nl/J)  time.  The  performance  of  the  mesh  of  meshes 
can  be  compared  to  the  mesh  with  reduced  diameter  [33], 
The  mesh  with  reduced  diameter  has  s  x  s  processors  with 
each  processor  having  n2/s2  local  memory.  The  mesh  with 
reduced  diameter  provides  linear  speedup  for  several  image 
computations,  for  s  in  the  range  1  to  n2/3.  All  the  problems 
mentioned  in  the  above  theorem  can  be  solved  in  0(n2/ s'2  + 
s)  time.  Thus,  the  fastest  possible  solution  on  the  mesh 
with  reduced  diameter  takes  0(n2/s)  time,  while  it  takes 
0(n*/2)  time  on  the  mesh  of  meshes. 

Related  results  and  details  can  be  found  in  [4,2,3] . 


2.4  Mesh  with  Reconfigurable  Bus 


A  mesh  with  reconfigurable  bus  (reconfigurable  mesh)  of 
size  N  consists  of  a  2-dimensional  mesh  connected  array 
of  processors  (mesh)  of  size  N,  with  each  processor  con¬ 
nected  to  a  broadcast  bus.  This  bus,  like  the  mesh,  is  also 
constructed  as  an  N x’i  x  N */*  grid,  where  the  processors 
are  connected  to  the  bus  at  the  intersections  of  the  grid. 
Further,  each  bus  link  between  intersections  has  a  switch 
embedded  in  it.  The  two  processors  at  each  end  of  the  link 
can  control  the  switch.  These  switches  allow  the  broadcast 
bus  to  be  divided  into  subbuses,  where  each  subbus  can 
function  as  a  smaller  reconfigurable  mesh.  Other  than  the 
buses  and  switches  the  reconfigurable  mesh  is  similar  to  the 
standard  mesh  in  that  it  operates  in  SIMD  (single  instruc¬ 
tion  stream,  multiple  data  stream)  mode  and  has  O(N) 


.area,  under  the  assumption  that  processors,  switches,  and 
single  links  as  having  constant  size. 

In  each  subbus  shared  by  multiple  processors,  at  any 
given  time  we  assume  that  at  most  one  processor  may  use 
the  bus  to  broadcast  a  value,  where  a  value  consists  of 
0(log  N)  bits.  Two  computational  models  will  be  dis¬ 
cussed  regarding  the  assumption  about  the  delay  that  a 
broadcast  requires.  The  unit-time  delay  model  will  assume 
that  all  broadcasts  take  0(1)  time,  as  is  the  assumption 
in  [8,43,47,55,56]  for  models  that  assume  various  broad¬ 
casting  strategies.  We  will  also  consider  the  log-time  delay 
model  in  which  it  is  assumed  that  each  broadcast  takes 
0(log  s)  time  to  reach  all  processors  connected  to  its  sub¬ 
bus,  where  s  is  the  maximum  number  of  switches  in  a  mini¬ 
mum  switch  path  between  two  processors  connected  on  the 
bus.  In  this  paper,  for  sake  of  simplicity,  we  illustrate  the 
unit-time  delay  model.  Related  results  can  be  found  in 
the  joint  work  with  Russ  Miller  and  Quentin  Stout  who 
independently  investigated  a  similar  organization  [35]. 

Major  advantages  of  the  reconfigurable  mesh  are  as  fol¬ 
lows: 


1.  Buses  can  be  used  to  speed  up  parallel  arithmetic  and 
logic  operations  among  data  stored  in  different  pro¬ 
cessors.  The  reconfiguration  scheme  supports  several 
techniques  on  the  CRCW  PRAM  model,  leading  to 
the  same  time  performance  as  in  the  PRAM  model 
having  N  processors. 


2.  The  reconfigurable  mesh  provides  an  environment  for 
efficient  sparse  data  movement  operations. 
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3.  A  significant  asymptotic  improvement  can  be  achieved 
in  the  running  time  of  algorithms  that  solve  several 
problems  on  the  reconfigurable  mesh  compared  to  ef¬ 
ficient  algorithms  for  the  mesh-of-trees,  pyramid  and 
mesh  with  static  broadcast  buses. 


4.  The  reconfigurable  mesh  can  act  as  a  universal  chip 
in  that  VLSI  organizations  with  equivalent  area  can 
be  simulated  without  loss  in  time. 


Well  known  organizations  such  the  Mesh-of-Trees  (MOT) 
and  the  pyramid  computer  can  be  efficiently  simulated  by 
the  reconfigurable  mesh  due  to  the  numerous  communica¬ 
tions  patterns  that  the  reconfigurable  mesh  provides. 

A  step  by  step  simulation  [49]  of  hierarchical  organi¬ 
zations  (pyramid,  MOT)  can  be  done  as  follows:  Define  a 
c-embedding  of  a  hierarchical  organization  onto  the  recon¬ 
figurable  mesh  to  have  the  following  properties. 


1.  A  constant  number  of  vertices  of  the  hierarchical  or¬ 
ganization  are  mapped  to  each  vertex  of  the  recon¬ 
figurable  mesh. 


2.  The  number  of  edges  between  levels  l  and  l  +  1,  0  < 
l  <  k  —  1,  incident  on  any  row  or  column  bus  segment 
is  <  e. 


Define  a  class  of  algorithms  to  be  normalized  algorithms  if 
the  following  hold  [63]. 


1.  During  a  computation  step  of  the  algorithm,  all  data 
operated  on  are  located  at  the  same  level  of  the  hi¬ 
erarchical  organization. 


2.  All  communication  steps  are  performed  between  at 
most  two  adjacent  levels. 


This  leads  to  the  following  results: 


Proposition  2.1 


1.  Any  normalized  algorithm  running  in  T(N)  time  on 
a  mesh  of  trees  of  base  size  N  can  be  simulated  on  a 
reconfigurable  mesh  of  size  N  to  finish  in  0(T(N)) 
time. 

2.  Any  algorithm  running  in  T(N)  time  on  a  mesh-of- 
trees  of  base  size  N,  can  be  simulated  on  a  reconfig¬ 
urable  mesh  of  size  N  to  finish  in  0(T(N)  log,  N) 
time.  Further  this  time  is  optimal. 

8.  Any  algorithm  running  in  time  T(N)  on  a  pyramid 
of  size  N  can  be  simulated  on  a  reconfigurable  mesh 
of  size  N  in  0(T(N))  time. 


We  now  turn  our  attention  to  the  simulation  of  the 
reconfigurable  mesh  by  the  pyramid. 


Proposition  2.2  Any  algorithm  running  on  the  reconfig¬ 
urable  mesh  of  size  N  in  time  T(N )  under  the  unit-time 
delay  model  can  be  simulated  on  a  pyramid  of  size  N  in 


0(T(N)Nll*)  time.  This  simulation  is  optimal. 


Details  of  the  above  simulations  and  the  proof  of  opti¬ 
mality  can  be  found  in  [49,46]. 

The  reconfigurable  mesh  can  provide  efficient  commu¬ 
nication  between  processors  when  the  amount  of  data  is 
sparse.  Such  fast  communication  of  data  is  useful  in  effi¬ 
cient  parallel  solutions  to  many  problems  involving  graphs 
and  digitized  images. 


Image  Problems 


Many  problems  involving  digitized  images  can  be  solved 
efficiently  on  the  reconfigurable  mesh.  The  input  for  these 
problems  is  an  N1!2  x  N1?2  digitized  image  distributed  one 
pixel  per  processor  on  a  reconfigurable  mesh  of  size  N  so 
that  processor  PE,  j  has  pixel  (i,j).  The  problems  that  we 
examine  focus  on  labeling  figures  and  determining  proper¬ 
ties  of  the  figures. 


Theorem  2.7  Given  an  N 1  l2  x  Nll2  digitized  image  mapped 
one  pixel  per  processor  onto  the  processors  of  a  reconfig¬ 
urable  mesh  of  size  N  in  a  natural  fashion,  the  figures  can 
be  labeled  in  0(logJ  N)  time. 

The  above  theorem  is  proved  by  using  the  divide  and 
conquer  approach  discussed  in  section  2.1  and  by  using  the 
fact,  that  connected  components  of  a  graph  represented 
by  its  N1!2  x  N1?2  adjacency  matrix  can  be  computed  in 
0(log  N)  time  on  the  reconfigurable  mesh. 

A  closest  figure  to  each  figure  can  be  computed  effi¬ 
ciently  on  the  reconfigurable  mesh  using  the  following  di¬ 
vide  and  conquer  technique  and  the  reconfigurable  buses. 

Assuming  that  the  figures  are  already  labeled,  the  al¬ 
gorithm  operates  in  two  phases.  In  the  first  phase  each 
processor  belonging  to  a  figure,  locates  the  closest  proces¬ 
sor  to  itself  having  a  different  label  as  follows.  Using  the 
row  and  the  column  buses  each  border  processor  locates 
its  closest  l’s  on  its  row  and  column  using  a  bus  splitting 
mechanism  [46],  By  this,  the  segments  of  the  row  and  col¬ 
umn  buses  between  l’s  are  isolated.  The  labels  and  the 
indices  of  the  border  processors  defining  each  segment  are 
broadcast  on  the  segment.  Now,  each  processor  having  a 
0  and  located  in  between  l’s  (two  l’s  in  its  column  and 
two  l’s  in  its  row)  computes  the  closest  1  having  a  dif¬ 
ferent  label.  Performing  a  prefix  computation  within  each 
segment  in  parallel  for  all  the  segments,  each  border  pro¬ 
cessor  knows  which  is  its  closest  processor  having  a  1  with 
a  different  label. 

During  the  second  phase,  bloc’  *  are  merged  in  a  bot¬ 
tom  up  approach  to  identify  a  unique  nearest  figure  to 
each  figure.  We  start  with  the  assumption  that  all  the 
processors  with  the  same  label  located  on  the  periphery  of 
a  block  of  size  k  x  k  have  the  same  information  regarding 
the  closest  1  to  the  label.  On  the  reconfigurable  mesh  the 
minimum  or  maximum  of  N ’/2  values,  stored  in  a  column 
can  be  computed  in  0(1)  time  [46],  Using  this  result,  the 
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closest  1  to  each  figure  can  be  computed  within  the  block 
of  size  2Jt  x  2Jfc  in  0(1)  time.  Using  the  updated  informa¬ 
tion,  the  processors  on  the  periphery  of  the  2 k  x  2k  block 
can  start  the  new  iteration.  Each  iteration  of  the  second 
phase  takes  0(1)  broadcasts  and  bus  splitting  operations 
leading  to: 

Theorem  2.8  Given  an  Nl/2xN'l2  digitized  image  mappea 
one  pixel  per  processor  onto  the  processors  of  a  reconfig¬ 
urable  mesh  of  size  N  in  a  natural  fashion,  in  0(log  N) 
time  a  closest  figure  to  each  figure  can  be  determined. 

In  computing  the  convex  hull  and  the  geometric  prop¬ 
erties  of  a  single  figure,  efficient  use  of  the  reconfigurable 
bus  leads  to: 

Theorem  2.9  Using  a  Nl/ 2  x  N1/1  reconfigurable  mesh, 
the  convex  hull  of  a  single  figure  can  be  computed  in  0( log  N] 
time  while  its  diameter  and  a  smallest  enclosing  box  can 
be  determined  in  0(1)  time. 

Proof  of  the  above  theorems  as  well  as  related  results 
can  be  found  in  [46,49]. 

3  Parallel  processing  of  segment 
based  input  problems 

During  the  past  few  years,  researchers  at  USC  have  de¬ 
veloped  algorithms  for  image  understanding  using  linear 
segments  as  primitives  [29,30].  In  this  section  we  turn 
our  attention  to  the  parallelization  of  these  algorithms. 
Here,  we  illustrate  the  parallelization  of  such  techniques 
by  studying  the  image  matching  problem.  More  details 
as  well  as  parallel  implementation  of  the  stereo  matching 
can  be  found  in  [48].  First  we  outline  a  well  known  tech¬ 
nique  for  image  matching.  Then,  we  present  a  strategy  to 
partition  the  input  data  set.  This  partitioning  technique 
leads  to  achieving  a  linear  speedup  on  a  mesh  connected 
computer.  We  also  show  a  systolic  implementation  of  the 
algorithm. 

Image  Matching  using  Linear  Segments 

The  problem  is  to  match  two  images.  The  primitives  used 
by  the  algorithm  are  linear  segments  [29].  These  are  de¬ 
scribed  by  their  orientation,  endpoints  of  their  coordinates 
and  average  contrast.  The  matching  procedure  is  based 
on  the  discrete  relaxation  technique  [29].  Let  N  and  M 
denote  the  set  of  vectors  in  the  two  images.  Let  |  M  |=  m, 

|  JV  |=  n  and  n  m.  A  primitive  in  the  first  image  is 
denoted  as  object  a,,  1  <  «  <  n.  A  primitive  in  the  second 
image  is  denoted  as  label  /,,  1  <  j  <  m-  The  technique 
computes  the  quantity  p{i,j),  in  [0,1],  which  is  the  possibil¬ 
ity  that  object  a,  corresponds  to  label  lj.  The  assignment 
of  object  ai  to  label  lj  relies  on  geometrical  constraints. 


This  means,  that  when  we  assign  a  label  lj  to  an  object 
a,,  we  expect  to  find  an  object  a/,  with  possible  assigned 
label  It  in  an  area  depending  on  i,j,k.  This  area  is  called 
window  w(i,j,k).  This  window  is  defined  as  follows:  Rep¬ 
resent  the  object  a,-  by  the  vector  A,B,,  the  label  l,  by 

the  vector  P,Q,  and  the  label  lk  by  the  vector  PkQk-  The 
w(i,j,  k)  is  the  parallelogram  RiRiS1St  determined  by  the 
equations: 

1.  AJt  i  =  PfPk 

2.  RiSj  =  PtQk 

3.  BiRj  =  QjPk 

4.  RtSz  =  PkQk 

The  pairs  (*,/)  and  (h,  k)  are  called  compatible 
{(i,j)C(h,k)),\l  ah  is  in  w(i,j,k)  and  a,  is  in  w(h,k,j). 

The  iteration  formula  is:  For  every  (i,j),  p,+1(i‘,j)  =  1, 
where  t  denotes  the  tth  iteration,  if  p‘(i,j)  =  1  and  there 
exist  a  subset  S  C  M  such  that:  V/,  £  S  there  exists 
a  a*  £  N  such  that  p‘(k,s)  =  1  and  (i,j)C(k,s).  The 
algorithm  stops  when  V(»,  j)  pt+1(i,j)  =  p‘(i,j). 

3.1  Partitioning  Strategy 

The  proposed  architectures  have  k  processing  elements  PEs 
k  <  m,n.  The  partitioning  strategy  is  as  follows:  Initially, 
when  the  entire  input  is  loaded  to  the  main  memory,  de¬ 
tect  the  vector  with  the  minimum  length.  Let  u  denote  the 
length  of  this  vector.  Divide  the  entire  image  in  squares  of 
size  k’/Ju  x  kll2u.  In  any  of  these  squares  there  can  be  at 
most  k  vectors.  A  list  of  all  vectors  crossing  each  region 
is  created.  Adjacent  squares  are  merged  as  long  as  the  to¬ 
tal  number  of  vectors  in  these  squares  is  <  k.  During  the 
merge  operation,  regions  are  merged  so  that  the  number 
of  vectors  shared  by  adjacent  regions  is  minimized.  As  will 
be  shown  in  our  implementation  this  partitioning  strategy 
provides  linear  speed  up.  This  is  due  the  fact  that  all  the 
k  processors  of  the  architecture  can  be  kept  busy  and  the 
interprocessor  communication  can  be  restricted  to  fetched 
regions.  An  example  of  partitioning  an  image  into  regions 
is  shown  in  figure  3.1. 

Notice  that  the  above  strategy  can  lead  to  increased 
amount  of  data  to  be  processed.  If  the  image  is  sparse  this 
problem  does  not  exist.  This  happens  because  although 
two  consecutive  square  regions  contain  two  versions  of  the 
same  segment  initially,  after  the  merging  operation  these 
regions  will  be  merged  into  one.  So  finally  the  segment 
will  be  located  in  a  constant  number  of  square  regions.  In 
case  of  a  dense  image  (m  ss  mmal,  where  mmax  is  the  max¬ 
imum  possible  number  of  segments  in  the  image)  there  can 
be  (2/3 )yfk  segments  crossing  the  boundary  of  two  square 
regions  so  that  the  two  regions  will  not  be  merged.  There¬ 
fore,  we  add  at  most  (2/3)(v/m  x  (i/m/yfk))  data  records 
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The  image  partitioned  into  regions  for  k=8 

to  the  input  data  set.  In  the  case  of  segments  of  arbitrary 
length  r,  the  maximum  number  of  all  such  segments  in  the 
image  cannot  exceed  (2/3 )mmoI,  else  merging  of  regions 
will  not  take  place.  Thus,  we  need  (2/3)m/VX  additional 
records  in  the  worst  case  [48]. 

The  partitioning  strategy  for  stereo  matching  is  similar 
to  the  above  technique  [48]. 

3.2  Parallel  Implementations 

The  implementation  of  the  image  matching  algorithm  with 
respect  to  the  partitioning  strategy  developed  above,  is 
based  on  the  following  idea:  Assume  we  have  a  mesh  con¬ 
nected  computer  with  k  processors  and  each  processor  has 
0(log  m  +  A) b it  memory.  During  an  iteration  we  examine 
a  pair  («,/)•  A  set  of  k  objects  belonging  to  a  region  and 
the  labels  of  the  corresponding  region  are  fetched  on  the 
mesh.  The  p(x,y)’s  are  also  fetched,  so  that  the  processor 
having  az  will  also  have  p(x,yz),  I  <  z  <  k.  For  each  label 
b  the  PE(0,0)  constructs  a  record  with  the  label  6,  the  di¬ 
mensions  of  the  window  w[i,j,b)  and  a  flag.  Initially  the 
flag  is  reset.  PE(0,0)  propagates  this  record  to  all  the  PEs 
in  the  mesh.  Each  PE  containing  a  segment  x  examines 
the  relation  (i,j)C(x,b)  as  well  as  p(x,b).  If  (i,j)C{x,b)  is 
true  and  p{x,b)  =  1  and  the  flag  is  reset  it  sets  the  flag.  If 
the  PE  receives  more  than  one  record,  it  performs  an  OR 
operation  between  the  two  flags  and  considers  the  outcome 
of  this  OR  operation  as  the  new  value  of  the  flag.  If  the 
PE(Ac*/2  -  Ijfc1/*  -  1)  receives  a  record  with  the  flag  set  it 
increases  the  value  of  q[i,j)  by  1. 


The  propagation  of  the  record  from  PE(0, 0)  to  PE(fc1,J- 
1  ,  Ac1/2  —  1)  is  done  such  that  PE(y,z)  sends  the  record  to 
the  PE(y  +  l,z)  and  to  the  PE(y,z  +  1).  The  PEs  along 
this  line,  which  is  parallel  to  the  diagonal  PE^1/2  —  1,0)- 
PE(0,k*/2— 1)  read  from  their  memory  the  p‘(x,  6) ’s.  When 
a  w(i,j,b)  extends  further  than  the  boundary  of  the  re¬ 
gion  we  add  the  record  with  6  to  a  linklist  pointed  by  the 
records  of  the  adjacent  regions  (one  linklist  per  region).  At 
the  end  of  each  iteration  the  initial  pointer  of  the  linklist 
is  updated  to  nil. 

The  record  with  the  label  b  has  its  flag  set  to  true 
when  the  PE(fc1^2  —  l.fc1/2  —  1)  receives  the  record  with 
the  w(t,j,b)  and  its  flag  is  set. 

At  the  end  of  each  iteration  we  examine  q  and  we  decide 
about  the  assignment  [i,j). 

Implementing  the  above  strategy  and  pipelining  the 
computations  a  speed  up  of  0[k)  can  be  obtained.  De¬ 
tails  of  the  ideas  presented  here  can  be  found  in  [48]. 

Our  partitioning  strategy  also  leads  to  a  systolic  imple¬ 
mentation.  The  array  consists  of  k  +  \fk  cells  (PEs).  The 
PES  are  arranged  in  a  2-dimensional  grid  of  size  (fc1/2  + 
1)  x  A:1/2.  Each  PE  has  three  ports  A,B,C.  A  and  B 
are  the  input  lines  and  C  is  the  output  line.  The  B  in¬ 
puts  of  the  PE(l,x),  0  <  x  <  k */2  —  1,  and  the  A  inputs 
of  the  PE(y,  1),  1  <  y  <  A:1/2,  are  the  inputs  to  the  en¬ 
tire  array.  The  PEs  of  the  array  use  the  C  connection  to 
propagate  the  results  of  the  arithmetic/logic  operations. 
The  C  inputs  of  the  (fc1/2)th  row  are  also  used  as  con¬ 
trol  inputs.  Each  PE (p,q)  receives,  in  its  C  input,  the  re¬ 
sult  of  the  operation  which  takes  place  in  the  PE(p  +  1,  g) 
1  <  P  <  Ar1/*,  0  <  q  <  A;1/2  —  1.  The  difference  is  in 
the  PEs  of  the  first  row  (PE(0,z)  0  <  z  <  k l/2  —  1). 
There,  each  PE(0,  z)  sends  the  result  of  its  operation  to 
the  PE(0,z  +  1).  The  PE(0, z)  receives  in  its  B  input  the 
result  of  the  C  output  of  the  PE(l,z).  The  A  inputs  of 
a  PE(0,tu)  (0  <  u;  <  A:1/2  —  2)  have  a  0  as  input.  The  A 
input  of  the  PE(0,A:1^2  —  1)  is  connected  to  the  C  output 
of  the  same  PE. 

The  outline  of  an  iteration  is  as  follows:  We  fetch  to 
the  klf*  C  inputs  the  object  a,  and  the  label  l,.  Then  k 1/2 
labels  are  fetched  to  the  B  inputs.  We  arrange  these  data 
items  to  be  as  in  a  transposed  diagonal  array  of  size  A;1/2  x 
Ac1/2.  A:1/2  objects  are  fetched  to  the  A  inputs  arranged  in 
a  diagonal  array  of  size  kl/1  x  kl/*.  This  input  sequence  is 
repeated  \/k  times.  Between  two  consecutive  repetitions 
of  this  input  sequence,  we  fetch  \/k  0’s  to  the  A  inputs. 
Following  the  labels  array,  there  is  an  array  of  size  A:1/2  x 
A;1/2  which  contains  the  p[x,y)  relations,  where  az  belongs 
to  the  set  of  the  processed  objects  and  /„  to  the  set  of 
labels  that  are  processed  during  this  iteration.  Between 
two  consecutive  rows  of  p(i,j)'s  that  we  fetch  to  the  A 
inputs,  we  fetch  the  label-set  input.  At  the  end  of  the 
iteration  at  the  output  of  the  PE(0,  Ac1/2—  1)  we  have  q{i,j). 

The  systolic  array  for  Ac1/2  =  3  is  shown  in  figure  3.2. 
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Speedup  of  0(k1^)  can  be  achieved  in  the  above  sys¬ 
tolic  implementation.  Details  and  proof  of  correctness  of 
the  above,  as  well  as  notes  on  the  systolic  implementation 
of  the  stereo  matching  can  be  found  in  [48], 


basic  cell 


PE(0,0)  Jr 


Figure  5:  The  systolic  array  for  k1^  =  3 
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Parallel  Hardware  for  Constraint  Satisfaction 
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Abstract 

A  parallel  implementation  of  constraint  satisfaction  by  arc 
consistency  is  presented.  The  implementation  is  constructed 
of  standard  digital  hardware  elements,  used  in  a  very  fine¬ 
grained,  massively  parallel  style.  As  an  example  of  how  to 
specialize  the  design,  a  parallel  implementation  for  solving 
graph  isomorphism  with  arc  consistency  is  also  given. 

Complexity  analyses  are  given  for  both  circuits.  Running 
time  for  the  algorithms  turns  out  to  be  linear  or  0(n2),  and 
if  the  I/O  must  be  serial,  it  will  dominate  the  computation 
time.  Fine-grained  parallelism  trades  off  time  complexity  for 
space  complexity,  but  the  number  of  gates  required  is  only 

0(n4). 


1  Introduction 

Constraint  satisfaction  is  an  important  technique  used 
in  the  solution  of  many  artificial  intelligence  problems. 
Since  the  original  applications  such  as  Waltz  filtering 
[Waltz  1975],  an  essential  aspect  of  most  constraint  satis¬ 
faction  algorithms  has  been  their  cooperative  or  parallel 
nature  (eg.  tDavis  and  Rosenfeld  1981]).  While  the  par¬ 
allel  spreading  activation  nature  of  constraint  propagation 
has  been  adopted  whole-heartedly  in  specific  applications 
such  as  connectionist  relaxation  [Feldman  and  Ballard  1982] 
[Hinton  et  al.  1984],  some  of  the  most  complete  and  gen¬ 
erally  useful  formal  results  analyze  sequential  algorithms 
[Mackworth  and  Freuder  1985].  Generating  a  formal  anal¬ 
ysis  of  one  recent  connectionist  implementation  of  discrete 
relaxation  [Cooper  1988]  inspired  us  to  design  a  massively 
parallel  implementation  of  the  classic,  more  generally  appli¬ 
cable  arc  consistency  constraint  satisfaction  algorithm,  as  de¬ 
scribed  by  Mackworth  [1977]  and  Hummel  and  Zucker  [1983]. 
The  implementation  is  constructed  of  standard  digital  hard¬ 
ware  elements,  used  in  a  very  fine-grained,  massively  parallel 
style.  The  resulting  circuit  is  thus  an  obvious  candidate  for 
fabrication  in  VLSI,  and  is  thus  similar  to  the  work  of  Mead 
[1983], 

The  pap^r  also  provides  a  parallel  hardware  implementa¬ 
tion  of  the  arc  consistency  algorithm  for  a  specific  applica¬ 


tion  -  labelled  graph  matching.  Such  matching  by  constraint 
propagation  and  relaxation  is  often  used  in  visual  recognition 
systems  [Cooper  1988]  [Kitchen  and  Rosenfeld  1979], 

Complexity  analyses  are  given  for  both  circuits.  Running 
time  for  the  algorithms  turns  out  to  be  linear  or  0(n2),  and 
if  the  I/O  must  be  serial,  it  will  dominate  the  computation 
time.  Fine-grained  parallelism  trades  off  time  complexity  for 
space  complexity,  but  the  number  of  gates  required  is  only 

0(n"). 

2  Constraint  Satisfaction 

In  this  section,  we  review  constraint  satisfaction  as  formu¬ 
lated  by  Mackworth  [1977]  [1985]  and  Hummel  and  Zucker 
[1983].  A  constraint  satisfaction  problem  (CSP)  is  defined  as 
follows:  Given  a  set  of  n  variables  each  with  an  associated 
domain  and  a  set  of  constraining  relations  each  involving  a 
subset  of  the  variables,  find  all  possible  n-tuples  such  that 
each  n-tuple  is  an  instantiation  of  the  n  variables  satisfying 
the  relations.  We  consider  only  those  CSPs  in  which  the  do¬ 
mains  are  discrete,  finite  sets  and  the  relations  are  unary  and 
binary. 

A  ^-consistency  algorithm  removes  all  inconsistencies  in¬ 
volving  all  subsets  of  size  k  of  the  n  variables.  In  particular, 
node  and  arc  consistency  algorithms  detect  and  eliminate  in¬ 
consistencies  involving  k  =  1  and  2  variables,  respectively. 

More  specifically,  a  typical  arc  consistency  problem  consists 
of  a  set  of  variables,  a  set  of  possible  labels  for  the  variables, 
a  unary  predicate,  and  a  binary  predicate  with  an  associated 
constraint  graph.  For  each  i  of  the  n  variables,  the  unary 
predicate  P,(x)  defines  the  list  of  allowable  labels  x  taken 
from  the  domain  of  the  variables.  For  each  pair  of  variables 
(i,j)  in  the  constraint  graph  the  binary  predicate  Qij{x,y) 
defines  the  list  of  allowable  label  pairs  (x,y).  To  compute 
the  n-tuples  which  satisfy  the  overall  problem  requires  that 
the  local  constraints  are  propagated  among  the  variables  and 
arcs. 

Mackworth  [1977]  specifies  one  such  constraint  sat¬ 
isfaction  algorithm  for  arc  consistency:  AC-3.  In 

[Mackworth  and  Freuder  1985]  it  is  shown  that  the  complex¬ 
ity  of  AC-3  is  0(a2( .),  where  a  is  the  number  of  labels  (or  the 
cardinality  of  the  domain),  and  c  is  the  number  of  edges  in 
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the  constraint  graph  associated  with  Qij(x,y). 

Hummel  and  Zucker  describe  a  parallel  version  of  the  arc 
consistency  algorithm  as  follows  (using  Mackworth’s  nota¬ 
tion). 

Arc  consistency  is  accomplished  by  means  of  the  label  dis¬ 
carding  rule:  discard  a  label  x  at  a  node  i  if  there  exists  a 
neighbor  j  of  i  such  that  every  label  y  currently  assigned  to  j 
is  incompatible  with  x  at  i,  that  is,  ~'Q:j{x,y)  for  all  y  €  Dj. 
The  label  discarding  rule  is  applied  in  parallel  at  each  node, 
until  limiting  label  sets  are  obtained. 

3  A  Hardware  Implementation 

The  Arc  Consistency  (AC)  chip  consists  of  two  arrays  of  JK 
flip-flops  and  suitable  amounts  of  combinational  circuitry. 
The  most  important  part  of  the  design  is  the  representa¬ 
tion  for  the  two  constraint  tables  P,{x)  and  Qij(x,y).  In  the 
massively  parallel  connectionist  design  style,  we  adopt  the 
unit/value  principle,  and  assign  one  memory  element  to  rep¬ 
resent  every  possible  value  of  P,(x)  and  Qt](x,y  ).  (As  will  be 
seen,  JK  flip-flops  are  used  as  the  memory  elements  because 
of  their  convenient  reset  characteristics).  For  the  hardware  to 
be  universal  on  any  arc  consistency  problem,  the  two  arrays 
must  be  able  to  represent  any  given  Pi(x)  and  Qij(x,y)  of 
sizes  bounded  by  n  and  a. 

The  first  (node)  array  consists  of  an  flip-flops  we  call  u(i,  x) 
which  are  initialized  to  Pt(x).  That  is,  if  x  is  a  valid  label  at 
node  i,  then  the  the  flip-flop  u(i,x)  is  initialized  to  on.  Thus 
initially  at  least,  the  flip-flops  which  are  on  all  correspond  to 
labellings  of  a  node  which  are  valid  considering  only  the  local 
(unary)  constraint  at  that  node.  Note  that  all  flip-flops  are 
initialized.  The  ultimate  answer  to  the  computation  (which 
labels  are  arc  consistent  at  each  node)  will  be  contained  in 
this  array  at  the  end  of  the  computation. 

The  second  (arc)  array  consists  of  a2n(n  —  1)  flip-flops  we 
designate  v(i,j,x,y )  which  are  initialized  to  conform  to  the 
arc  constraint  table  Qij(x,y).  Note  that  the  table  Qij{x,y ) 
can  designate  three  things.  If  Qij(x,y)  =  1,  then  the  arc  (i,j) 
is  present  in  the  constraint,  graph  and  the  label  pair  (x,y)  is 
a  valid  labelling  of  the  pair.  If  Qij(x,y)  =  0,  the  arc  (i,j)  is 
again  present  in  the  constraint  graph,  but  the  label  pair  (i,j) 
in  not  allowed  on  that  arc.  But  Q,;(x,j/ )  might  also  just 
not  be  present  in  the  arc  constraint  table,  which  indicates 
that  there  is  no  consistency  constraint  between  nodes  i  and 
j.  To  account  for  the  fact  that  Qi}(x,y)  might  be  incomplete, 
v(i,j,x,y)  is  initialized  as  follows: 
if  i  is  adjacent  to  j  in  the  constraint  graph 

v(i,j,x,y)  =  Qij(x,y) 

otherwise 

v(i,j,x,y)  =  1 

Note  that  the  arc  array  is  static;  it  does  not  change  through¬ 
out  the  computation. 

The  basic  structure  of  the  two  arrays  of  flip-flops  is  shown 
in  Figure  1. 
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Figure  1:  Unary  and  Binary  Constraint  Tables 


It  remains  only  to  develop  combinational  circuitry  which 
implements  the  label  discarding  rule  -  ie.  that  causes  the  flip- 
flop  representing  the  label  x  at  node  i  to  be  reset  to  zero  if 
it  becomes  inconsistent.  The  combinational  circuitry  is  thus 
designed  so  that  the  K  (reset)  input  of  the  JK-flip-flop  u(i,  x) 
receives  the  value: 

reset(u(i,x))  =  ->  (u(j,y)  Av(i,j,x,y)) 

;= 1 j/i  y=l 

The  J  input  of  each  JK-flip-flop  is  tied  to  0.  A  partial  circuit 
diagram  for  this  equation  is  given  in  Figure  2.  This  figure 
show  the  reset  circuitry  for  one  flip-flop  in  the  node  table 
u{i,x).  In  the  figure,  the  entire  node  table  is  present,  but 
only  the  part  of  the  arc  table  v(i,j,x,y)  useful  for  this  node 
is  drawn.  An  analagous  circuit  for  each  node  completes  the 
whole  circuit. 

To  interpret  the  equation  and  circuit,  consider  first  the 
inner  term  u(j,y)  Av(i,j,x,y)  for  a  particular  case  of  u(i,x). 
The  fact  that  v(i,j,x,y)  is  true  tells  us  that  there  is  an  arc 
between  i  and  j,  and  (x,y)  is  a  consistent  label  pair  for  this 
arc.  We  already  know  that  u(i,x )  is  true;  anding  with  u(j,y) 
checks  that  the  other  end  of  the  arc  has  a  valid  label.  Point 
A  on  the  circuit  diagram  in  Figure  2  shows  where  this  term 
is  computed. 

At  this  point,  as  far  as  node  i  is  concerned,  x  is  a  label 
consistent  with  node  neighbour  j’s  label  y.  The  VJ=i  simply 
ensures  that  at  least  one  label  y  on  neighbouring  node  j  is 
consistent.  This  function  has  been  computed  after  the  or 
gate  at  point  B  in  Figure  2. 

Label  x  on  node  i  is  thus  consistent  with  its  neighbour 
j.  But  what  about  node  i’s  other  neighbours?  The  A)=i j/, 
ensures  that  there  is  arc  consistency  among  all  node  i’s  neigh¬ 
bour’s.  The  and  gate  at  C  in  Figure  2  ensures  this. 

If  the  signal  is  on  at  point  C,  that  means  that  label  x  is 
consistent  for  node  i  -  therefore,  the  flip-flop  need  not  be 
reset.  Thus  the  not  gate. 

To  reverse  the  analysis,  if  some  node  j  does  not  have  a 
consistent  labelling,  then  at  point  B,  the  signal  will  be  off. 
The  and  will  fail,  so  the  signal  at  C  will  also  be  0,  and  then 
the  not  gate  will  cause  flip-flop  u(i,x)  to  be  reset. 


3.1  Correctness 


To  begin  with,  recall  that  we  are  interested  ir.  discarding 
labels,  an  operation  which  corresponds  to  resetting  on  flip- 
flops  to  0.  Furthermore,  since  the  J  input  of  each  JK-flip-flop 
in  the  node  array  is  tied  to  zero,  the  flip-flops  can  only  ever 
be  reset  to  0,  never  set.  Once  they  are  off  they  must  stay 
off,  so  the  whole  process  is  clearly  monotonic.  Therefore,  all 
we  need  to  show  for  correctness  is  to  show  that  the  network 
correctly  applies  the  label  discarding  rule.  If  the  network 
discards  labels  when  they  should  be  discarded,  and  does  not 
discard  them  when  the  should  be  kept,  then  it  implements 
the  label  discarding  rule  correctly. 

The  label  discarding  rule  can  be  formally  expressed  as  fol¬ 
lows: 

Bj(j  yf  »)Vy[u(_7,  y)  A  v{i,j,x,y)  =  0] 

But  this  expression  is  equivalent  to 

n  a 

A  V  (u(i>  S')  Au(m>  *>!/))  =  0 

;=1  y=l 

or 

n  a 

A  V(“0>3')A  *>5/))  =  i 

J=lJ?4t'V=l 

which  is  just  the  condition  under  which  ft,  x)  is  reset.  There¬ 
fore,  the  network  correctly  discards  labels  when  it  should. 
The  converse  follows  from  negating  the  above  equations. 

3.2  Complexity 

The  circuit  requires  an  JK-flip-flops  for  the  node  array,  and 
a2n(n  —  1)  flip-flops  for  the  arc  array.  From  Figure  2,  we  see 
that  there  is  an  and  gate  for  every  flip-flop  in  the  arc  array, 
so  a2n(n  —  1)  2-input  and  gates  are  required  for  this  purpose. 
For  each  of  the  an  flip-flops  in  the  node  array  there  is  n  —  1  or 
gates  required,  each  taking  a  inputs  -  a  total  of  an(n  —  1)  or 
gates.  Finally,  there  are  an  and  and  not  gates  (nand  gates), 
each  taking  n  —  1  inputs.  There  are  also  0(a2n2)  wires. 

The  worst  case  time  complexity  of  the  network  occurs  when 
only  one  JK-flip-flop  is  free  to  reset  at  a  time.  So  if  propa¬ 
gation  through  the  and  and  or  gates  is  considered  instanta¬ 
neous,  the  worst  case  time  complexity  is  an.  If  a  logarithmic 
time  cost  is  assigned  to  the  large  fan-in  and  and  or  gates  the 
worst  case  time  complexity  is  0(alog(a)nlog(n)). 

Note  that  if  the  node  and  arc  arrays  must  be  initialized 
serially,  loading  them  takes  more  time  (0(a2n2)  steps)  than 
executing  the  algorithm.  For  almost  all  applications  of  con¬ 
straint  satisfaction  the  binary  predicate  Qij{x,y)  can  be  spec¬ 
ified  with  less  than  0(a2n2)  information,  and  so  instead  of  the 
arc  array  a  circuit  could  be  built  that  supplies  the  correct 
values  to  the  and  gates  without  needing  so  many  memory 
elements  to  fill.  An  application  in  which  this  is  true  is  graph 
matching,  which  we  describe  in  the  next  section. 


4  Graph  Matching 

Graph  matching  can  be  defined  as  a  constraint  satisfaction 
problem.  General  graph  matching  requires  k-consistency 


(and  is  NP-complete,  in  fact).  With  just  arc  consistency,  a  re¬ 
stricted  yet  still  interesting  class  of  graphs  may  be  matched. 
Furthermore,  the  effectiveness  of  matching  graphs  by  con¬ 
straint  satisfaction  with  only  arc  consistency  can  be  enhanced 
if  the  graphs  are  labelled.  This  kind  of  restricted  matching  of 
labelled  graphs  is  particularly  suited  to  the  visual  indexing 
problem  [Cooper  1988].  In  this  problem,  labelled  graphs  are 
used  to  represent  structurally  composed  objects.  The  con¬ 
straint  satisfaction  process  is  used  only  to  filter  recognition 
candidates,  and  the  few  graphs  not  discriminable  with  the 
limited  power  of  arc  consistency  can  be  addressed  in  other 
ways. 

If  labelled  graph  matching  is  framed  as  a  constraint  satis¬ 
faction  process,  the  unary  constraint  is  that  the  labels  on  cor¬ 
responding  vertices  be  the  same.  The  binary  (arc)  constraint 
ensures  that  the  connectedness  between  pairs  of  correspond¬ 
ing  vertices  be  the  same.  In  other  words,  if  there  is  an  edge 
between  2  vertcies  in  one  graph,  there  better  be  an  edge  be¬ 
tween  the  corresponding  vertcies  in  the  other  graph.  In  this 
section,  we  describe  without  loss  of  generality  the  matching 
of  undirected  graphs. 

So,  for  the  graph  matching  problem: 

Pj(x)  =  (label(i)  =  label(x)) 
and 

Qij{x,y)  =  (adjacent)!,  j)  =  adjacent(x,t/)) 

For  the  graph  matching  problem  the  number  of  possible  labels 
equals  the  number  of  vertices  so  a  =  n. 

There  are  some  modifications  we  can  make  to  the  general 
arc  consistency  circuit  that  are  to  our  advantage  for  this  par¬ 
ticular  application. 

Constraint  Table  Computation  by  Special- 
Purpose  Circuitry 

One  modification  is  to  replace  the  arc  array  by  a  circuit  de¬ 
signed  as  follows.  Construct  two  arrays  of 

n  'i  n(n  -  1) 

2  )  ~  2 

flip-flops  representing  adjacent)!,  j)  and  adjacent(x,j/)  re¬ 
spectively.  Note  that  these  are  adjacencies  in  the  in¬ 
put  graphs,  not  in  the  constraint  graph.  For  all  possi¬ 
ble  ( i,j){x,y )  pairs,  wire  one  flip-flop  from  the  (i,j)  ar¬ 
ray  and  one  flip-flop  from  the  (x,  y)  array  to  a  gate  com¬ 
puting  the  equality  function  xor .  Then  the  output  of  the 
(x,y))’th  gate  represents  Qt](x,y).  Then  the  network 
will  have  only  0(n2)  flip)- flops  to  load  prior  to  the  computa¬ 
tion. 

Analogous  special  purpose  circuitry  to  compute  P,(x)  from 
the  vertex/label  sets  of  each  graph  can  easily  be  imagined  as 
well.  In  the  case,  the  equality  gate  must  check  equality  of  the 
labels,  so  is  likely  comparing  more  than  just  a  single  bit. 

In  any  case,  it  is  clear  that  actually  computing  the  con¬ 
straint  tables  Pj{x)  and  Q,;(x,j/)  may  be  a  significant  part 
of  the  overall  computation.  In  many  specialized  cases,  it  is 


clearly  possible  to  actually  build  parallel  circuitry  to  assist 
in  computing  the  constraint  tables,  rather  than  serially  com¬ 
puting  the  predicates  beforehand  and  then  loading  them  into 
the  parallel  hardware. 


Graph  matching  need  not  be  simply  isomorphism,  as  many 
vision  applications  emphasize  [Shapiro  and  Haralick  1981],  If 
we  restrict  ourselves  to  pure  isomorphism  however,  the  graph 
matching  problem  is  symmetric.  In  terms  of  the  constraint 
satsifaction  formulation,  the  symmetry  means  that  the  ver¬ 
tices  of  graph  A  have  to  be  possible  labels  for  graph  B  as  well 
as  vice  versa.  Therefore  for  a  flip-flop  (!,  x)  to  stay,  one  may 
require  it  to  be  consistent  regarding  x  as  the  label  and  !  as  the 
vertex  and  vice  versa.  So  in  addition  to  the  and-or  network 
described  for  the  general  constraint  satisfaction  problem  the 
graph  matching  circuit  has  a  complementary  network  in  the 
opposite  direction.  The  two  circuits  are  anded  together  be¬ 
fore  the  inverter  at  the  K  input  of  the  JK  latch.  Together 
these  circuits  compute 

■■(l  A  V  Mi>  y) A  Qo'(*.  y)) )  A  (  A  V  K?.  y) A  Qo(*>  sO) 

\  \;=1  J/t  ir=l  /  \if=l,y/r;=l 

The  circuit  which  implements  this  equation  finds  all  possi¬ 
ble  labelings  that  are  pairwise  consistent  both  for  matching 
graph  A  to  graph  B  and  for  matching  graph  B  to  gTaph  A. 

4.1  Complexity 

If  no  special  purpose  circuitry  is  used  to  compute/^-  (x)  and 
it  is  input  as  a  table  of  an  or  n 2  entries  (in  this  case,  a  =  n), 
then  the  complexity  is  as  follows.  The  node  array  requires 
n2  JK-flip-flops.  The  reduced  arrays  representing  the  input 
graphs  require  a  total  of  n(n  —  1)  flip-flops.  To  replace  the  arc 
array,  there  are  n2(n  -l)2  xor  gates.  Analogous  to  the  earlier 
design  n(n  —  1)  2-input  and  gates  are  required,  n2(n  —  1)  or 
gates,  and  n2  nand  gates.  There  are  0(n4)  wires,  as  for  the 
general  constraint  satisfaction  network. 

The  worst  case  time  complexity  for  the  graph  matching 
network  is  the  same  as  for  the  constraint  satisfaction  net¬ 
work,  0(n2)  ignoring  propagation  time  and  0(n 2  log2  n)  tak¬ 
ing  it  into  account.  Loading  and  unloading  the  network  takes 
0(n2)  sequential  time,  and  so  does  not  affect  the  worst-case 
performance  of  the  network.  Since  the  expected  time  of  the 
constraint  satisfaction  step  could  be  much  less  than  the  worst- 
case  performance,  sequential  loading  and  unloading  is  still 
likely  to  be  the  performance  bottleneck. 

4.2  Comparison  with  Connectionist  Net¬ 
work 

Cooper  [1988]  gives  a  connectionist  network  design  for  solving 
the  same  labelled  graph  matching  problem  addressed  here. 
Interestingly,  although  the  two  networks  were  developed  from 
completely  different  heritages,  and  for  different  reasons,  they 
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are  remarkably  alike.  In  particular,  the  central  aspect  of  both 
designs  -  the  representation  of  the  unary  and  binary  con¬ 
straint  predicates  as  completely  filled-in  tables  -  is  exactly 
the  same.  This  reflects  the  adoption  of  the  unit/value  design 
principle,  which  is  useful  for  obtaining  a  very  high  degree  of 
parallelism,  no  matter  what  the  primitive  units  of  computa¬ 
tion. 

Unlike  the  digital  network,  the  connectionist  network  is 
never  intended  to  interface  with  sequential  processes,  and 
the  input  constraint  tables  are  filled  by  parallel  spreading 
activation.  As  a  result,  the  I/O  bottleneck  does  not  occur. 
Of  course,  if  the  digital  network  receives  parallel  input,  the 
same  is  true. 

5  Discussion  and  Conclusions 

The  utility  of  constraint  satisfaction  methods  in  the  solution 
of  many  AI  problems  suggests  that  efficient  implementations 
might  be  widely  useful.  Furthermore,  constraint  satisfaction 
methods  have  an  obvious  parallel  character. 

In  this  paper,  we  have  given  a  massively  parallel  design 
which  provably  implements  one  classic  constraint  satisfac¬ 
tion  algorithm.  Our  implementation  thus  inherits  the  cor¬ 
rectness  characteristics  of  the  original  formulation.  We  have 
also  shown  how  this  design  is  easily  specializable  for  particu¬ 
lar  problems.  This  specialization  process  provides  a  desirable 
alternative  to  designing  and  proving  a  new  parallel  network 
for  each  particular  problem. 

As  might  be  expected,  the  highly  parallel  implementation 
runs  very  fast.  Although  technical  worst  case  running  time  is 
linear  in  the  number  of  variables,  it  is  much  more  reasonable 
to  expect  that  the  network  runs  in  a  small  constant  number 
of  time  steps.  Overall,  if  I/O  time  is  not  included,  the  per¬ 
formance  of  the  network  can  be  expected  to  be  much  better 
than  that  of  the  best  sequential  implementations. 

It  would  be  straight  forward  to  construct  our  arc  consis¬ 
tency  chip,  even  for  the  general  case.  If,  however,  the  parallel 
machine  is  forced  to  interface  with  sequential  processes,  the 
run-time  complexity  becomes  similar  to  that  expected  from 
standard  sequential  implementations  of  arc  consistency.  The 
I/O  bottleneck  can  be  overcome  by  supplying  parallel  input 
or  by  specializing  the  chip  to  solve  a  particular  problem,  as 
we  showed  in  the  graph  matching  example. 
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Abstract 

This  paper  describes  the  motivation  for  and  progress 
toward  the  implementation  and  use  of  a  special-purpose  VLSI 
processor  array  for  computer  vision.  A  Scan  Line  Array 
Processor  (SLAP)  includes  one  processing  element  (PE)  for 
each  pixel  on  a  single  image  scan  line,  arranged  in  a  linear 
array.  Each  PE  has  internal  fixed-point  functional  units  and  a 
register  bank.  A  SLAP  operates  in  SIMD  fashion,  with  a 
system  controller  broadcasting  a  single  instruction  to  all  PEs 
in  each  cycle.  Adjacent  processing  elements  can  exchange 
data  in  each  cycle,  and  a  dedicated  video  shift  register  per¬ 
forms  concurrent  image  I/O. 

The  simple  control  structure  and  linear  topology  of  a  SLAP 
lend  themselves  to  cheap,  last  implementations  and  to  easy 
scaling  with  technology  improvements.  Given  this  promise  of 
great  cost  effectiveness,  a  key  issue  is  the  efficiency  with 
which  problems  can  be  mapped  onto  the  array.  We  have 
designed  a  number  of  algorithms  and  algorithm  mapping 
schemes  that  demonstrate  efficient  mappings  for  a  broad 
range  of  important  image  operations,  both  local  and  global. 

In  order  to  fully  assess  the  practical  value  of  the  SLAP 
approach,  we  are  constructing  a  prototype  array  with  512 
processing  elements.  Built  around  custom  2  micron  CMOS 
chips,  the  array  and  its  controller  will  occupy  three  standard- 
size  circuit  cards  and  deliver  some  4  billion  16  bit  integer 
operations  per  second.  Performance  estimates  on  some  com¬ 
monly  used  algorithms  indicate  that  this  raw  throughput  can 
be  put  to  good  use.  In  addition  to  developing  hardware  and 
algorithms,  we  are  also  building  low-level  programming  tools 
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and  an  optimizing  compiler  for  a  high-level  parallel  language 
tailored  for  image  processing.  This  paper  gives  an  overview 
of  the  SLAP  concept  and  the  current  implementation  work, 
and  provides  some  comparative  performance  predictions. 

1.  Introduction 

Computer  perception,  particularly  vision,  demands  ex¬ 
tremely  high  computation  rates  from  units  that  must  be  com¬ 
pact  and  inexpensive.  Many  parallel  architectures  for  image 
processing  and  low-level  vision  have  been  constructed,  and 
many  more  have  been  proposed.  These  architectures  usually 
fall  into  one  of  three  classes:  those  that  attempt  to  operate  on 
an  entire  image  in  parallel,  like  MPP  (Batcher  1980,  Potter 
1985),  CL1P4  (Duff  1976,  Duff  and  Fountain  1986),  DAP 
(Flanders  et  al  1977),  and  the  Connection  Machine  (Hillis 
1985),  all  of  which  operate  as  two-dimensional  array  proces¬ 
sors;  those  that  scan  an  image  and  operate  on  a  small  neigh¬ 
borhood  window  at  each  step,  as  do  many  industrial  vision 
systems,  the  Cytocomputer  (Lougheed  and  McCubbrey  1980), 
and  PIPE  (Kent  et  al  1985);  and  those  with  a  structure  not 
directly  related  to  image  structures  or  image  computations, 
including  systolic  machines  like  Warp  (Kung  and  Menzil- 
cioglu  1984,  Annaratone  et  al  1987),  the  NEC  dataflow 
machine  ImPP  (lwashita  and  Temma  1986),  the  Hughes  HBA 
(Wallace  and  Howard  1987),  and  PASM  (Meyer  et  al  1985). 
Elsewhere  (Fisher  and  Highnam  1985)  we  have  argued  that 
the  grid  and  small-window  architectures  have  inherent  draw¬ 
backs  for  image  computations  in  terms  of  scaling  for  cost  and 
performance.  The  more  general  purpose  architectures,  on  the 
other  hand,  tend  to  exhibit  a  higher  ratio  of  cost  to  perfor¬ 
mance  than  more  specialized  structures. 

In  this  paper,  we  detail  our  initial  explorations  of  an  ar¬ 
chitecture  that  takes  a  position  intermediate  between  the 
hypothetical  full-image  mesh  and  the  neighborhood  proces¬ 
sors.  The  scan  line  array  processor  (SLAP)  contains  one 
processor  for  each  pixel  on  a  scan  line,  in  the  simplest  case. 
This  allows  the  architecture  to  scale  gracefully  with  image 
size:  the  number  of  processors  and  the  speed  of  each  proces¬ 
sor  grow  linearly  with  the  perimeter  of  an  image.  We  are 
aware  of  two  other  investigations  along  similar  lines,  the 
CL1P7  (Duff  and  Fountain  1986,  Fountain  1985).  which  uses 
several  rows  of  processors  to  emulate  a  full  grid,  and  tire 
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bit-serial  PIXIE-5000  (Wilson  1985),  a  commercial  product 
of  Applied  Intelligent  Systems. 

Using  current  VLSI  technology,  it  is  possible  to  build,  at 
reasonable  cost,  a  full  scan  line  of  reasonably  powerful 
processors,  avoiding  the  intricacies  of  bit-level  programming 
and  spatial  image  decomposition.  Furthermore,  the  cor¬ 
respondence  between  the  structure  of  the  machine  and  the 
structure  of  an  image  allows  most  high-bandwidth  traffic  to 
take  place  independently  of  a  general-purpose  host  or  global 
bus,  thus  avoiding  the  need  for  expensive  support  hardware. 
Finally,  as  we  demonstrate  in  this  paper,  SLAPs  can  ef¬ 
ficiently  execute  a  large  set  of  important  image  computations 
and  are  good  targets  for  current  compiler  technology. 

2.  Scan  Line  Array  Processors 

For  current  computer  vision  systems,  typical  image  sizes 
are  512  by  512  eight  bit  pixels.  In  the  future,  image  sizes  are 
likely  to  increase  with  improvements  in  sensor  technology 
and  demands  for  higher  precision  in  perception.  However, 
the  image  processing  grids  of  which  we  are  aware  are  not 
capable  even  now  of  providing  one  processor  per  pixel.  In 
fact,  by  sacrificing  individual  processor  power  (large  grid 
machines  have  all  been  bit-serial)  to  permit  the  massive 
duplication  of  processor  sites,  the  effectiveness  of  the  pro¬ 
gramming  model  that  these  machines  present  for  image 
processing  has  weakened.  In  this  section  we  outline  a  proces¬ 
sor  structure  that  provides  a  good  match  with  the  problem, 
and  motivate  some  of  the  architectural  decisions  described 
later. 

When  an  image  is  moved  between  major  components 
within  a  system,  the  transfer  is  of  a  stream  of  pixels.  When 
the  source  of  the  image  is  a  sensor  then  the  stream  comprises 
a  sequence  of  rows.  A  SLAP  operates  on  each  row,  or  “scan 
line  ’  as  it  arrives,  a  distinct  processor  receiving  each  image 
column.  The  usual  mode  of  operation  for  a  SLAP  is  therefore 
to  “sweep”  down  an  image  as  shown  in  Figure  2-1.  Each 
SLAP  processor  communicates  with  its  two  nearest  neighors, 
forming  a  long  linear  array.  Input  and  output  of  image  rows 
is  achieved  by  a  video  rate  shift  register  built  into  the  array. 
This  allows  a  scan  line  to  be  shifted  in  at  the  same  time  as 
earlier  scan  lines  are  being  processed  and  a  result  scan  line  is 
being  shifted  out,  yielding  online  real-time  processing. 


Figure  2-1 :  The  SLAP  vector  sweeps  down  an  image 

The  SLAP  processor  array  uses  the  Single  Instruction 
stream,  Multiple  Data  stream  (SIMD)  control  paradigm.  In 


addition  to  minimizing  overhead  costs  for  instruction 
decoders,  instruction  memory,  and  control  flow  logic,  SIMD 
control  simplifies  inter-PE  communication  because  instruc¬ 
tion  execution  is  lock-step.  This  avoids  the  cost  and  delay  of 
buffering  and  synchronization  hardware.  A  SLAP’s  only 
high-bandwidth  communication  routes  are  the  nearest- 
neighbor  connections.  Despite  this  simplicity  of  control  and 
communication,  we  show  below  that  a  SLAP  can  be  effec¬ 
tively  programmed  to  perform  a  surprisingly  wide  range  of 
tasks  with  high  efficiency. 


Figure  2-2:  A  SLAP  system 

Compared  with  a  grid  machine,  a  SLAP  has  only  a 
moderate  number  of  processors.  This  fact  can  be  used  to 
advantage  to  invest  hardware  resources  in  more  powerful 
processing  elements  (PEs).  Images  rarely  have  binary  pixels. 
Thus,  the  vast  majority  of  arithmetic  operations  performed 
during  the  processing  of  an  image  are  on  multi-bit  operands, 
generating  multi-bit  results.  Therefore,  the  PE  word  size  can, 
and  should,  be  quite  large.  The  disadvantage  of  a  large  word 
is  that  the  PE  datapath  will  not  be  fully  utilized  by  some 
applications.  On  the  other  hand,  programming  a  machine 
with  a  very  small  native  word  size  requires  the  programmer  to 
be  aware  of  the  maximum  data  sizes  at  each  step  during  the 
computation,  or  simply  assume  and  allocate  space  for  the 
maximum  size  possible  for  each  quantity  from  the  beginning. 
The  other  extreme  is  to  use  a  floating  point  representation. 
This  we  regard  as  being  too  expensive  to  implement  in  the 
context  we  have  examined  and  unnecessary  for  most  image 
computations.  We  see  large  integers  as  a  good  compromise 
between  bit-serial  and  floating  point  processing  for  an  image- 
processor. 

The  Achilles'  heel  of  any  SIMD  machine  is  inflexibility. 
In  order  to  provide  for  data-dependent  computation,  nearly 
every  such  architecture  provides  for  conditional  execution, 
whether  centralized  as  in  1LLIAC-1V  (Hord  1982)  or  handled 
locally  to  each  PE.  Another  important  type  of  flexibility  is 
provided  by  local  address  generation  ability  in  each  PE.  This 
capability  is  found  in  ILLIAC  and  in  other  large-word  SIMD 
machines.  It  is  not  found  in  bit-serial  machines,  for  the 
simple  reason  that  address  generation,  storage  and  com¬ 
munication  would  dwarf  the  rest  of  the  processor.  A  third 
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type  of  flexibility  is  the  provision  for  single  cycle  shift  and 
rotation  operations  with  locally  determined  distances,  crucial 
to  flexible  and  efficient  bit-field  manipulations.  All  of  these 
mechanisms  are  provided  in  the  SLAP  processor. 

The  controller  of  a  SLAP  system  must  be  able  to  receive 
feedback  from  the  vector  for  data  reduction  operations,  as¬ 
sociative  operations  or  calculations  with  data-dependent  run¬ 
times.  This  is  achieved  in  two  simple  ways:  the  controller 
provides  two  registers  that  the  PEs  at  the  vector  extremes  treat 
as  neighbors,  and  any  PE  can  write  to  a  single  bit 
“some/none”  line  that  the  controller  reads.  These 
mechanisms  are  cheap  to  implement  and  provide  a  channel 
for  low-bandwidth  results  and  a  predicate  evaluation  ability 
for  iterative  operations.  A  controller  is,  of  course,  able  to 
broadcast  literals  within  the  instruction  stream. 

This  basic  organization  is  very  flexible.  Cheaper,  lower- 
performance  machines  may  be  constructed  by  using  a  shorter 
array  and  simulating  many  cells  with  one.  For  most  al¬ 
gorithms,  higher  performance  can  be  achieved  by  running 
several  arrays  in  parallel  or  pipelined.  Certain  types  of  image 
operations  can  be  facilitated  by  more  frequent  sampling  of  the 
video  shift  register.  The  machine  can  be  used  either  online 
for  real-time  processing  or  offline.  Real-time  processing  can 
be  achieved  even  for  some  algorithms  where  the  time  to 
process  a  scan  line  is  data-dependent,  by  asynchronously  buf¬ 
fering  several  scan  lines  as  necessary  in  the  cells’  registers. 

The  linear  organization  also  offers  great  leverage  in  terms 
of  scaling  with  technology  or  image  size.  Any  number  of 
processors  can  be  packed  onto  a  chip  without  changing  its 
pinout,  and  the  same  applies  to  circuit  boards.  Systems  of  any 
reasonable  size  can  be  contructed  by  modular  composition. 

The  array  can  be  made  more  powerful  by  the  addition  of  a 
variety  of  architectural  features,  at  varying  costs.  Perhaps  the 
most  obvious  is  to  supply  each  cell  with  enough  memory  to 
hold  one  or  several  image  columns  (as  in  CL1P7).  This  may 
be  achieved  (at  current  circuit  densities,  with  only  a  few  PEs 
per  chip)  with  external  memory,  or  with  on-chip  memory.  If 
external  memory  is  used,  a  cost/utility  tradeoff  between 
broadcast  and  individual  memory  addresses  exists. 

Depending  on  the  application,  a  SLAP  can  be  used  in  a 
variety  of  system  configurations.  The  simplest  is  as  a  dedi¬ 
cated  image  processor/enhancer,  where  the  array  is  controlled 
by  a  code  ROM  and  sequencer,  and  all  I/O  is  digital  video. 
Another  step  up  in  complexity  is  an  image  processing  system 
based  on  a  host  computer.  Here  the  SLAP  is  provided  with 
code  in  a  RAM  that  can  be  written  by  the  host,  and  video  I/O 
is  performed  by  frame  buffers  that  are  also  accessible  to  the 
host.  A  fully  configured  vision  subsystem  might  add  a  more 
complicated  microcoded  array  controller  and  interfaces  to  the 
array’s  data  ports.  A  key  cost  feature  of  all  of  these  con¬ 
figurations  is  that  the  video  data  stream,  the  only  necessarily 
high-bandwidth  path  in  the  system,  is  implemented  with 
point-to-point  links  and  regular  addressing  patterns.  The  host 


bus  and  host  memory,  usually  the  bottlenecks  in  a  general- 
purpose  system,  receive  only  low-bandwidth  filtered  infor¬ 
mation  from  the  array.  Thus  high-throughput  vision  process¬ 
ing  can  be  performed  on  a  very  inexpensive  host. 

3.  SLAP  Algorithms 

The  success  of  a  parallel  architecture  in  serving  a  given 
application  depends  on  how  effectively  the  computations  in 
question  can  be  mapped  onto  the  structure  of  the  machine. 
The  goal  is  to  keep  as  many  processors  as  possible  busy  doing 
useful  work,  thereby  making  cost-effective  use  of  the 
hardware  at  hand.  For  a  SLAP,  the  key  issues  bearing  on  its 
success  are  communication  and  control.  The  linear  array 
structure  has  small  global  bandwidth  (indeed,  this  is  a  major 
source  of  its  economy  of  implementation),  and  hence  al¬ 
gorithms  involving  completely  general  and  unpredictable  data 
access  will  not  perform  well.  Also,  a  SLAP’s  SIMD  control 
structure  does  not  efficiently  support  highly  heterogeneous 
computations.  In  this  section  we  review  results  we  have 
described  in  more  detail  elsewhere  (Fisher  1986,  Fisher  and 
Highnam  1987),  showing  that  a  wide  variety  of  vision  tasks, 
including  many  that  are  usually  thought  of  as  being  “global” 
in  nature,  can  be  performed  efficiently  on  a  SLAP. 

3.1.  Neighborhood  Operations 

Image  to  image  transformations  in  which  a  pixel  is  replaced 
by  a  function  of  the  pixels  within  a  k  by  k  window  are  easily 
implemented  on  a  SLAP.  Specific  examples  include  convolu¬ 
tion  and  median  filters,  together  with  the  usual  selection  ol 
neighborhood  logic  operations  used  in  thinning  and  cellular 
automaton  simulations.  The  row-by-row  nature  of  the  pro¬ 
gramming  model  requires  that  values  and  partial  results  be 
cached  over  several  rows  within  each  PE.  Table  lookup 
transformations  can  be  achieved  within  a  SLAP  that  has  very 
limited  memory  at  each  PE  by  spreading  disjoint  segments  of 
the  table  across  a  number  of  adjacent  PEs  or  repeated  broad¬ 
cast  of  the  complete  table. 

3.2.  Region  Operations 

Many  vision  tasks,  particularly  in  industrial  vision,  depend 
on  finding,  labeling,  and  extracting  parameters  of  image 
regions.  Regions  may  defined  by  intensity,  local  gradients, 
range  data,  or  texture,  for  example.  Because  there  is  only  a 
limited  amount  of  time  (in  general)  available  for  the  process¬ 
ing  of  each  image  row,  those  tasks  that  require  substantial 
lateral  transmission  of  data  are  not  easily  handled  by  a  single 
straightforward  sweep  as  described  earlier.  Instead,  a  variety 
of  techniques  are  possible,  including:  post-processing  of  a 
limited  amount  of  information  on  the  host;  processing  the 
image  in  vertical  strips,  the  demarcation  lines  corresponding 
to  maximal  lateral  transmission  distances;  sweeping  the  image 
transposed;  and  receiving  a  column-staggered  version  of  the 
image  (two  pixels  adjacent  in  the  same  image  row,  are 
presented  to  the  vector  in  successive  rows,  columns 
unchanged).  The  common  theme  is  that  the  vector  and  its 
controller  perform  the  bulk  of  the  computation  with  the  assis¬ 
tance  of  a  flexible  memory  controller,  leaving  high-level 
processing  for  the  host. 
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3.3.  Projections 

Many  important  operations  can  be  viewed  in  terms  of 
projections  along  various  curves  or  lines  (Sanz  and  Dinstein 
1987).  Of  particular  interest,  however,  are  linear  projections; 
the  linear  Hough  transform  (Duda  and  Hart  1973),  widely 
used  for  finding  lines  in  images,  consists  in  essence  of  projec¬ 
tions  along  sets  of  lines  oriented  at  angles  from  zero  to  7t.  On 
a  SLAP,  linear  projections  are  easily  computed  by  shifting 
accumulators  from  neighbor  to  neighbor  along  the  lines  of 
projection  as  the  image  passes  through  the  array.  For  angles 
between  it/4  and  3rt/4,  each  accumulator  moves  no  more  than 
one  pixel  horizontally  for  each  scan  line.  For  shallower 
angles,  extra  cycles  can  be  devoted  to  the  necessary  shifting 
or  the  transpose  of  the  image  can  be  processed  in  a  second 
pass.  The  SLAP  prototype  under  construction  can  perform  a 
Hough  transform  with  ninety  angular  values  in  a  single  real¬ 
time  pass  (Fisher  and  Highnam  1987).  Other  projections, 
including  the  generation  of  a  two-dimensional  view  of  a 
three-dimensional  scene,  have  been  cast  in  terms  of  one¬ 
dimensional  operations  (Robertson  1987)  and  can  be  ef¬ 
ficiently  performed  on  a  SLAP. 

3.4.  Image  Transforms 

A  number  of  image  processing  applications  call  for  2D 
image  transforms,  such  as  the  Fourier,  Walsh-Hadamard  and 
Haar.  These  are  all  separable,  facilitating  their  efficient  im¬ 
plementation  on  a  SLAP  endowed  with  enough  memory  per 
PE  to  accomplish  a  one  column  fast  transform  internally. 
Such  a  SLAP  can  perform  an  arithmetically  simple  transform 
within  two  frame  times,  with  the  time  taken  dominated  by  the 
time  to  tranpose  intermediate  results.  The  time  taken  by 
transforms  requiring  a  great  deal  of  arithmetic  processing  per 
output  point  is  dominated  by  arithmetic,  leading  to  very  high 
processor  utilization. 

3.5.  Miscellaneous  Algorithms 

•  Skeletonization  uses  a  local  neighborhood  opera¬ 
tion  iteratively  to  reduce  “black”  areas  of  an 
image  to  their  centerlines.  More  efficient  on  a 
SLAP  is  a  grassfire  (Montanari  1968)  algorithm, 
using  two  passes  over  the  image  to  compute  the 
minimal  distances  of  each  black  pixel  to  a  white 
one,  then  a  pass  thresholding  to  preserve  local 
maxima.  A  SLAP  implementation  would  use 
column-staggering  or  transposition. 

•  Histogramming  is  a  straightforward  operation  on 
a  SLAP.  The  accumulation  is  shared  between 
PEs  during  the  image  pass;  a  post-pass  phase 
accumulates  histogram  bins  and  shifts  the  his¬ 
togram  out  of  the  vector. 

•  Image  resampling  on  a  SLAP  is  straightforward 
provided  selective  line  fetch  (under  the  command 
of  the  controller)  or  large  column  memories  arc 
available. 

•  Image  warping  has  an  efficient  SLAP  implemen¬ 
tation  when  independently  addressable  column 
memories  are  available.  Vertical  displacements 


are  handled  by  addressing,  while  horizontal  dis¬ 
placements  are  handled  by  shifting. 

•  Operations  that  involve  multiple  images  in  pixel- 
to-pixel  operations  can  be  achieved  by  interleav¬ 
ing  image  rows  (for  example)  as  they  are 
delivered. 

3.6.  General  Comments 

In  addition  to  these  general  algorithm  classes,  many  other 
problems  seem  to  be  well-suited  to  SLAP  computation.  In 
general,  a  SLAP  is  more  flexible  than  strictly  local  image 
processors  in  that  state  related  to  contiguous  features  of  an 
image  can  be  maintained  and  passed  among  nearby  processors 
as  necessary.  Global  transfer  of  information  can  be  achieved 
via  the  fast  shift  register  and  the  broadcast  SIMD  control, 
allowing  flexibility  far  beyond  that  of  most  window  proces¬ 
sors  or  nearest-neighbor  structures.  The  addition  of  sufficient 
memory  to  hold  an  image  column  at  each  processor  opens  the 
door  to  a  class  of  algorithms  that  neither  window  processors 
nor  two-dimensional  arrays  can  use  effectively.  Some  general 
approaches  to  algorithm  design  for  a  linear  array  are  dis¬ 
cussed  in  an  earlier  paper  (Fisher  1986). 

4.  SLAP  Programming  and  Compilation 

This  section  discusses  some  of  the  issues  involved  in  pro¬ 
gramming  a  SLAP.  We  begin  with  a  brief  review  of  SIMD 
programming  languages.  Following  a  description  of  the  loci 
of  program  control  in  a  complete  system,  we  discuss  the 
SLAP  programming  environment  that  we  are  constructing. 
An  interesting  feature  is  a  formulation  of  communication  in 
fine  grain  systems  that  facilitates  a  high  degree  of  automatic 
code  optimization. 

4.1.  SIMD  Programming 

In  terms  of  hardware  investment,  tire  SIMD  paradigm 
provides  a  simple  way  to  deliver  a  large  number  of  processing 
cycles.  However,  realizing  that  potential  without  laborious 
hand-coding  is  problematic.  General  purpose  machines  sup¬ 
port  the  common  high  level  languages  reasonably  efficiently. 
They  can  do  this  because  the  abstract  machines  that  underlie 
those  languages  are  not  too  dissimilar,  and  those  same 
abstractions  are  mapped  easily  to  conventional  processors. 
Another  way  of  saying  this  is  that  the  programming  models 
presented  by  each  high  level  language  are  easily  mapped  to 
conventional  processors. 

The  programmers  of  SIMD  machines  do  not  have  high 
level,  multi-machine  languages  to  choose  from.  This  is  be¬ 
cause  the  machines  constructed  to  date  have  been  sufficiently 
diverse  in  both  structure  and  intended  application  that  an 
efficient  mapping  from  even  one,  fixed,  language  to  several 
machines,  has  not  been  seen. 

Generally,  SIMD  systems  have  followed  the  same  route  in 
language  definition  by  augmenting  existing  high-level  lan¬ 
guages  with  constructs  that  make  it  quite  clear  to  the  compiler 
which  code  should  be  executed  in  the  PEs  and  which  in  the 
controllcr/host.  These  include  the  various  Fortran  and  Algol 
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implementations  on  the  Dliac  IV  (Hord  1982),  the  Pascal 
implementation  on  the  MPP  (Reeves  1984),  Connection 
Machine  Lisp  and  C  (Christman  1983,  Steele  and  Hillis 
1986),  C  on  the  PIXIE-5000  (Wilson  1985),  Fortran  on  the 
DAP  (Flanders  et  al  1977),  C  on  CLIP4  (Duff  and  Fountain 
1986),  and  C  on  PASM  (Kuehn  and  Siegel  1985).  Unfor¬ 
tunately,  as  revealed  by  user  surveys  (Perrott  and  Stevenson 
1981,  Wetherell  1980).  and  user  reports  (Marks  1980, 
Oldfield  and  Reddaway  1985,  Smith  1984),  a  user  must  still 
be  aware  of  the  dimensions  of  the  machine  and  the  data. 

4.2.  SLAP  Code  Generation 

Programming  a  SLAP  involves  generating  code  for  the 
PEs,  the  controller,  and  for  the  host.  The  synchronous  nature 
of  a  SLAP’s  communications  facilitates  the  compiler  genera¬ 
tion  of  interdependent  threads  of  computation  in  the  vector 
and  its  controller  from  a  single  program.  The  host’s  program 
interacts  with  the  vector  and  its  controller  only  occasionally 
and  asynchronously  with  the  pixel  data  stream,  so  a  sub¬ 
routine  or  coroutine  protocol  is  appropriate. 

The  SLAP  component  that  receives  host  commands  is  the 
controller.  The  controller  contains  the  code  for  the  PEs  and 
itself  in  separate  memory  units.  When  an  image  operation  is 
invoked,  the  controller  begins  executing  and  issuing  code 
from  a  certain  address.  The  controller  has  branching  and 
computational  abilities  of  its  own,  which  are  employed  con¬ 
currently  with  PE  instruction  execution  to  execute  the  usual 
constructs  of  a  sequential  language:  conditional  branching, 
iterative  loops,  and  so  on.  The  controller  interacts  with  the 
vector  using  data  as  well  as  control,  via  the  two  “end” 
registers.  Global  vector  status  can  be  sampled  using  the 
some/none  response  line.  Controller  and  PE  cycles  are 
synchronized,  making  the  interchange  of  data  easy.  The  con¬ 
troller  can  also  transfer  values  that  it  has  computed  to  all  PEs 
by  simply  incorporating  the  value  in  the  instruction  stream  as 
a  literal.  Note  that  there  is  no  logical  constraint  preventing 
the  controller  and  vector  obeying  different  code  for  each  input 
line. 

The  PEs  are  pipelined,  but  are  not  so  complex  that  the  PE 
instruction  stream  width  is  too  small  to  yield  precise  control 
of  the  activities  of  PE  components.  This,  with  the  symmetry 
of  the  functional  units,  makes  PE  code  generation  relatively 
straightforward.  The  impact  of  a  small  internal  memory,  local 
addressing,  and  the  conditional  capabilities  have  yet  to  be 
fully  explored.  The  word  size  is  expected  to  suffice  for  most 
operands,  but  multi-word  operands  are  supported. 

4.3.  SLAP  Programming 

Following  in  the  model  of  previous  SIMD  systems,  we  are 
implementing  a  language  that  is  an  extension  of  sequential 
languages,  in  our  case  these  are  Pascal,  and  C.  The  division  of 
labor  between  the  array  and  the  controller  is  implicitly 
specified  by  the  use  of  variables  declared  to  be  instantiated  in 
the  controller  or  the  PEs.  The  compiler  will  generate  code  for 
the  entire  system.  Within  this  framework  two  programming 
models  are  supported.  The  first  is  a  “native”  mode  for  the 


SLAP,  code  is  specified  on  a  scanline  by  scanline  basis.  The 
alternative  is  to  write  the  program  in  a  position-independent 
style.  A  large  proportion  of  image  computations  can  be 
expressed  as  implicitly  parallel  position-independent  opera¬ 
tions,  accessing  values  at  other  pixel  positions  using  relative 
offsets  (Harney  et  al  1987).  For  obvious  reaons,  this  model  is 
a  good  match  with  a  SIMD  machine,  especially  one  where 
PEs  are  directly  correlated  with  image  coordinates.  On  a 
SLAP,  a  program  written  in  the  position-independent  style  is 
compiled  into  the  code  executed  by  the  vector  and  the  con¬ 
troller  during  each  row. 

To  facilitate  the  coding  of  position-independent  programs 
and  establish  a  basis  for  efficient  code  optimization,  we  intro¬ 
duce  a  new  type  of  unary  operator  to  a  conventional  expres¬ 
sion  language,  the  directional  (Fisher  and  Highnam  1988).  In 
image  processing  we  use  four  directionals  LEFT,  RIGHT, 
UP  and  DOWN.  The  application  of  a  directional  such  as 
LEFT,  to  an  expression  e,  means  that  the  computation  at  the 
current  position  uses  the  value  of  e,  as  computed  at  the  pixel 
position  to  the  immediate  left.  Similarly  for  the  other  three 
directionals.  Thus,  for  example,  LEFT  UP  e,  yields  the  value 
of  c  as  computed  at  the  upper  left  neighbor.  These  semantics 
can  be  formalized  by  defining  the  usual  arithmetic  operators 
on  grids  of  values,  and  defining  directionals  as  coordinate 
translations.  The  formulation  of  an  algebra  of  communication 
and  computation  is  then  straightforward. 

A  drawback  to  a  simple-minded  implementation  of 
position-independent  code  is  tire  inability  to  reuse  inter¬ 
mediate  results  from  other  pixels.  For  example,  convolution 
with  a  symmetric  kernel  or  sorting  within  a  neighborhood 
involve  computing  values  that  can  be  useful  at  a  number  of 
distinct  pixels.  The  directional  formulation  allows  an  expres¬ 
sion  graph  optimizer  to  detect  redundant  computation  and, 
using  the  algebra  of  operators,  reorganize  the  graph  to 
produce  efficient  code. 

The  optimizing  component  of  a  compiler  for  a  grid  or  a 
SLAP  architecture  has  been  constructed  using  OPS5  (Forgy 
1981,  Brownston  et  al  1985)  to  manipulate  data-dependency 
graphs.  The  input  graph  is  derived  in  the  usual  way  from  an 
expression  that  includes  all  the  usual  arithmetic  operators  and 
directionals.  The  graph  is  transformed  to  minimize  the  total 
number  of  operations  contained  (including  directionals).  The 
compiler  algorithm  is  greedy  (in  the  technical  sense)  and 
makes  heavy  use  of  associative  properties  of  operators. 
Results  have  been  extremely  promising.  For  example,  the 
code  for  a  symmetric  convolution: 
global  a,b,c; 

result  :=  a  *  LEFT  UP  p  + 

b  *  UP  p  + 

a  *  RIGHT  UP  p  + 

b  *  LEFT  p  + 

o  *  p  + 

b  *  RIGHT  p  + 

a  *  LEFT  DOWN  p  + 

b  *  DOWN  p  + 

a  *  RIGHT  DOWN  p  ; 

is  transformed  to: 


tempi 
temp2 
tamp  3 
result 


=  tamp 2  +  OP  tempi  +  DOWN  tempi 
=  RIGHT  temp3  +  LEFT  temp 3  + 
DOWN  tamp?  +  UP  temp2  +  p  *  c 


The  optimizer  reduces  a  program  containing  12  com¬ 
munications,  8  additions  and  9  multiplications  into  one  using 
6  shifts,  6  additions  and  3  multiplications.  On  a  SLAP  doing 
single-precision  accumulation  and  using  8  bit  pixels,  this 
amounts  to  reducing  about  44  instructions  to  about  18.  The 
savings  grow  as  precision  increases,  since  multiplication  is  a 
multi -cycle  operation. 

The  Canny  edge  operator  has  the  property  that  the  convolu¬ 
tions  used  in  its  computation  are  separable.  Coding  this 
operator,  with  a  minimal  window  size  of  nine  pixels,  and  then 
running  the  result  through  the  opdmiser  produced  very  good 
results:  multiplications  reduced  from  26  to  8;  additive  opera¬ 
tions  down  from  17  to  9;  communication  steps  reduced  from 
24  to  8.  The  optimizer  produces  even  better  results  on  larger 
windows. 

Code  using  directionals  is  appropriate  for  direct  execution 
on  a  grid.  For  a  SLAP,  the  UP  and  DOWN  directionals  can 
be  interpreted  as  temporal  rather  than  spatial  operators  to 
yield  a  space-time  processing  schedule.  Different  operator 
interpretations  can  yield  transposed,  skewed  and  other  execu¬ 
tion  schedules  (Fisher  and  Highnam  1988). 

5.  Implementation 

We  are  in  the  midst  of  constructing  a  prototype  SLAP 
hardware  and  software  system.  The  centerpiece  of  this  effort 
is  a  custom  CMOS  chip  including  four  processing  elements 
and  an  instruction  decoder.  An  array  of  these  chips  together 
with  off-the-shelf  memory  will  be  driven  by  a  global  con¬ 
troller  and  interfaced  to  commercial  video  components  and  a 
host  workstation.  Finally,  we  are  constructing  a  suite  of 
program  development  tools  including  an  optimizing  high- 
level  language  compiler. 

5.1.  Array  Chip 

Each  processing  element,  as  shown  in  Figure  5-1,  includes 
an  eight  bit  video  shift  register  stage  and  a  sixteen  bit  datapath 
including  a  register  file,  an  ALU,  a  rotate/shift  unit  and  com¬ 
munication  registers.  Each  PE  also  has  a  local  control 
facility. 

•  The  register  file  has  32  entries  and  is  dual -ported, 
able  to  read  and  write  a  word  in  one  cycle.  The 
address  for  an  access  is  presented  either  globally 
or  from  a  local  address  register. 

•  The  ALU  performs  a  typical  set  of  operations.  A 
two-bit  multiplication  step  and  a  one-bit  divide 
step  are  provided,  supported  by  a  double-width 
accumulator  set.  Multiplication  and  division  se¬ 
quencing  are  supported  in  hardware,  freeing  the 
instruction  stream  for  operations  not  involving 
the  ALU.  Multi-precision  versions  of  the  arith¬ 
metic  operations  are  supported. 


•  The  rotate/shift  unit  performs  logical  and  arith¬ 
metic  shift  and  rotate  operations,  with  shift  dis¬ 
tances  determined  either  locally  or  globally.  A 
shift  or  rotate  of  any  distance  is  completed  in  a 
single  cycle. 

•  Adjacent  PEs  communicate  via  a  communication 
register  that  can  be  transferred  to  (and  loaded 
from)  a  neighbor  in  a  single  cycle.  A  PE  can 
send  a  signal  to  the  controller  by  writing  a  bit  to 
the  response  line  register,  which  is  then  be  ORed 
with  the  response  line  register  of  every  other  PE. 

•  Conditional  execution  is  achieved  by  causing 
processors  to  “sleep”  or  “wake”  according  to 
datapath  condition  codes.  Sleeping  processors 
are  unable  to  change  certain  elements  of  their 
state.  Support  is  provided  for  nesting  of  con¬ 
ditionals. 


Figure  5-1 :  Processing  Element  data  path 

The  array  chip  will  comprise  some  45,000  transistors,  about 
40,000  in  the  PEs  and  5, (XX)  in  the  decoder  and  peripheral 
circuits.  In  a  2-micron  process,  the  chip  will  occupy  a  die 
area  of  about  63  square  millimeters. 

We  expect  the  chip  to  achieve  an  instruction  issue  rate  of  8 
MHz,  and  a  video  shift  register  clock  rate  of  16  MHz.  For  a 
512  by  512  frame  scanned  at  30  Hz,  real-time  processing  on 
an  array  of  512  PEs  allows  some  500  instructions  to  be 
executed  for  each  pixel.  The  512-PE  array  has  an  aggregate 
computation  throughput  of  4  GOPS.  For  example,  this  allows 
a  9  by  9  convolution,  an  1 1  by  1 1  pseudo-median  filter,  or  a 
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21  by  21  edge  finding  operator  to  run  in  a  single  frame  time 
of  about  33  msec. 


5.2.  System  Configuration 

SLAP  processor  chips  will  be  mounted  on  printed  circuit 
boards  along  with  commercial  memory  chips  for  column 
memories.  Column  memories  are  accessed  using  the  video 
shift  register  for  data  and  the  neighbor  communication  path 
for  addresses.  Using  one  32K  by  8  static  RAM  chip  for  each 
array  chip,  we  can  supply  each  PE  with  4K  bytes,  and  allow 
one  memory  access  for  every  four  instructions.  A  separate 
board  will  hold  the  global  controller  and  host  interface.  We 
expect  to  fit  64  array  chips  and  their  external  memory  onto  a 
single  standard-size  Sun-3  VME  card;  thus  a  512-processor 
system  will  require  two  array  cards  and  a  controller  card.  The 
system  will  also  include  commercial  image  capture,  storage 
and  display  components. 

5.3.  Software 

We  have  a  simulator  for  a  SLAP  system  that  operates  at  the 
level  of  microcode  with  a  time  resolution  of  a  half  cycle.  This 
software  incorporates  an  assembler  for  the  PE  code.  More 
work  is  needed  on  this  program  to  provide  increased  logical 
error  detection  abilities,  to  permit  the  user  to  select  the  posi¬ 
tion  on  the  simulation-emulation  spectrum  to  use,  to  provide 
statistical  information,  and  to  refine  the  model  of  the  vector 
controller  and  communication  within  and  without  the  system. 

Work  is  proceeding  on  the  implementation  of  a  higher  level 
compiler  that  will  generate  code  for  both  the  vector  conuoller 
and  the  processing  elements.  There  will  be  two  versions.  The 
first  will  provide  a  more  pleasant  language  than  assembler  for 
expressions  and  a  few  simple  control  constructs.  The  second 
compiler  will  incorporate  the  optimization  ideas  described 
above  to  generate  efficient  code  without  excessive  contortions 
on  the  part  of  the  programmer. 

6.  Performance 

This  section  gives  some  performance  estimates  for  the 
SLAP  prototype  under  construction,  and  compares  them  to 
measured  performance  numbers  for  a  VAX/11-780  and  the 
CMU  Warp  machine  (Annaratone  et  al  1987),  a  10-processor 
floating  point  systolic  array.  VAX  numbers  were  measured 
on  tuned  integer  code,  while  Warp  numbers  were  taken  from 
the  paper  cited.  The  algorithms  considered,  all  on  512x512 
arrays,  are: 

•  binary  filtering:  3x3  neighborhood  logical  filter¬ 
ing  of  a  binary  image. 

•  histogram:  a  256  level  histogram. 

•  3x3  convolve:  8  bit  pixels  and  8  bit  weights.  The 
SLAP  code  accumulates  results  to  16  bits  preci¬ 
sion. 

•  3x3  median. 

•  Walsh  transform:  an  FFT-like  transform  with 
weights  of  +/-1  on  8  bit  pixels. 


*  Floyd-Warshall:  all-points  nearest  neighbor  in  a 
graph.  The  SLAP  implementation  uses  32  bit 
integers.  The  Warp  performance  number  has 
been  scaled  up  from  the  350  node  figure  given  in 
the  Warp  paper. 

Two  caveats  must  be  attached  to  these  figures.  One  is  that 
the  Warp  and  VAX  numbers  represent  measured  performance 
on  existing  machines,  while  the  SLAP  numbers  are  best  es¬ 
timates  based  on  handwritten  code.  The  second  is  that  the 
algorithms  chosen  are  typically  integer  programs;  thus  Warp's 
floating  point  prowess  is  hidden  under  a  bushel.  Nonetheless, 
the  numbers  shown  give  a  feel  for  SLAP’S  performance  on 
some  common  image  processing  operations  and  place  it  in 
context. 


Figure  6-1  shows  elapsed  times  for  each  of  the  machines  on 
each  of  the  algorithms.  Where  the  time  indicated  on  SLAP  is 
less  than  a  frame  time,  it  indicates  the  time  needed  to  perform 
that  operation  either  on  a  stored  image  or  on  an  image  already 
undergoing  other  processing.  Note  that  the  horizontal  axis  is 
logarithmic. 
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Figure  6-1:  Elapsed  times 


Figure  6-2  places  these  numbers  in  context  by  normalizing 
them  to  a  measure  of  raw  power.  Speedups  relative  to  VAX 
range  from  545  to  5104,  with  a  geometric  mean  of  2166. 
Speedups  relative  to  Warp  range  from  20  to  187.5  (though  the 
second  highest  speedup  is  50),  with  a  geometric  mean  of  47.1 
(35.7  with  the  high  figure  omitted).  The  vertical  bar  marked 
“OPS  ratio”  shows  the  speedup  expected  solely  on  the  basis 
of  peak  operations  performed  per  second.  The  scale  factors 
shown  rate  SLAP  at  4096  MOPS,  Warp  at  100  MOPS,  and 
VAX  at  1  MOPS.  Where  the  speedup  bars  cross  the  line, 
SLAP  performs  even  better  than  indicated  by  these  ratios;  for 
example,  the  SLAP  datapath  happens  to  be  particularly  ef¬ 
ficient  on  the  inner  loops  used  for  median  filtering  and  binary 
filtering.  Where  the  speedup  bars  lie  to  the  left  of  the  line, 
SLAP  does  not  achieve  its  theoretical  speedup.  For  his- 
togramming,  SLAP  is  able  to  achieve  only  a  factor  of  10  or  so 
in  useful  concurrency,  and  hence  does  not  fare  as  well  as 
usual  against  a  VAX.  The  lack  of  a  parallel  multiplier  ex- 
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plains  the  subpar  speedup  for  convolution.  Finally,  the  Floyd- 
Warshall  speedup  shows  the  penalty  incurred  on  SLAP  by 
using  32  bit  numbers,  which  on  a  large  dataset  require  4 
accesses  apiece  to  external  memory. 
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SLAP  speedup  relative  to  WARP 
Figure  6-2:  Scaled  speedups 

7.  Summary 

We  have  proposed  a  new  architecture  for  low-level  vision 
processing  that  is  highly  suitable  for  VLSI  implementation, 
and  have  shown  how  it  can  be  applied  to  a  large  set  of 
problems.  The  SLAP  architecture  can  be  deployed  in  a 
variety  of  system  environments,  from  dedicated  real-time  im¬ 
age  enhancement  boxes  to  a  programmable  vision  subsystem 
for  general-purpose  computers.  It  can  also  be  scaled  over  a 
wide  range  of  cost  and  performance,  according  to  user  needs. 
As  a  result,  SLAPs  appear  to  offer  high  throughput  in  a  large 
number  of  applications  at  modest  cost. 

Prototype  hardware  is  under  construction.  Simultaneously, 
we  are  developing  high  and  low  level  programming  tools. 
Experimentation  with  realistic,  complete  image  applications 
will  provide  a  more  complete  evaluation  of  the  SLAP  con¬ 
cept. 
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Pyramid  Algorithms  Implementation  on  the  Connection  Machine 
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Abstract 

Pyramid  architectures  have  lately  been  proposed  in  the 
literature  to  implement  a  large  number  of  image  understanding 
algorithms,  such  as  multi-resolution,  pipelined,  and  bottom/up 
algorithms.  In  these  architecture  each  processing  element 
(FE)  on  an  intermediate  layer  is  connected  to  nine  other  PE’s, 
four  on  the  same  level,  four  in  layer  below  it,  to  a  single  parent 
in  the  layer  above  it.  Hardware  implementation  of  such 
pyramid  machines  is  expensive  because  of  the  extensive 
wiring  involved.  The  Connection  Machine  (CM),  a  highly 
parallel  fine-grained  SIMD  machine  with  a  hypercube  and 
mesh  interconnection  networks,  has  been  available  for  some 
time  now  to  the  vision  community. 

In  this  paper,  we  describe  the  implementation  of  an  address 
mapping  scheme  that  efficiently  maps  the  pyramid 
architecture  onto  the  Connection  Machine  (CM),  resulting  in 
an  efficient  way  of  implementing  pyramid  algorithms  on  the 
CM.  In  this  mapping,  each  PE  in  the  CM  simulates  a  PE  on 
the  base  of  the  pyramid,  and  at  most  one  PE  on  the 
intermediate  levels.  Top/Bottom  communication  in  the 
pyramid  can  be  simulated  in  this  scheme  using  only  three 
hypercube  communication  cycles  in  the  CM.  On  the  other 
side,  mesh  communication  occurring  in  all  intermediate  levels 
of  the  pyramid  architecture  can  be  simulated  using  at  most  a 
number  of  hypercube  communication  cycles  that  is 
proportional  to  the  dimension  of  the  cube  (16  in  the  case  of  a 
64K  CM).  A  programming  environment  for  pyramid 
algorithms,  using  this  addressing  scheme,  has  been 
implemented  on  the  CM.  It  allows  the  user  to  create  pyramid 
data  structures,  load/unload  images  from  various  pyramid 
levels,  move  data  up/down,  and  perform  several  operations 
such  as  convolution  and  hierarchical  operators  on  the  created 
data  structures. 

1  Introduction 

Image  analysis  tasks  require  a  large  amount  of  computation  to 
process  the  large  quantities  of  data  contained  in  images  with 
reasonable  resolution.  To  meet  this  requirement,  a  number  of 
parallel  architectures  [2],  [3],  [6],  and  [7],  [5]  have  been 
proposed  for  application  to  image  analysis  problems. 


Recently  pyramid  architectures  have  been  proposed  to 
implement  efficiently  (in  real  time)  image  analysis  tasks, 
specially  multi-resolution,  and  top-down/bottom-up  image 
analysis  tasks  ( [1 1],  [3],  [4],  and  [8]).  Examples  of 
algorithms  that  have  been  proposed  for  implementation  on 
pyramid  machines  can  be  found  in  [  1  ],  [10],  and  [12],  The 
pyramid  architecture  consists  of  a  set  of  mesh-connected 
layers  of  processing  elements  (PE’s)  successively  decreasing 
in  size  by  a  factor  of  four. 

Each  PE  on  an  intermediate  layer  is  connected  to  four  children 
on  the  layer  below  it,  to  a  single  parent  in  the  layer  above  it, 
and  to  four  (or  nine)  neighbors  in  the  same  layer,  as  shown  in 
Figure  1.  Hardware  implementation  of  pyramid  architectures 
with  large  number  of  PE’s  is  expensive  because  of  the  large 
amount  of  wiring  that  is  required  in  such  machines. 

The  Connection  Machine  (CM),  a  highly  parallel,  fine¬ 
grained,  single  instruction  stream,  multiple  data  stream 
(SIMD)  machine  with  mesh  and  hypercube  interconnection 
networks,  has  been  recently  developed  at  Thinking  machines 
Inc.  and  a  64K  version  has  been  available  commercially. 

In  this  paper,  we  describe  an  addressing  scheme  that  maps 
efficiently  the  pyramid  architecture  onto  the  Connection 
Machine  (CM).  By  efficiently  here,  we  mean  that  pyramid 
communication  can  be  simulated  on  the  CM  using  a  very 
small  number  of  short  fast  CM  cycles.  This  results  in  an 
efficient  implementation  of  pyramid  algorithms  on  the  CM.  In 
this  mapping,  each  PE  in  the  CM  simulates  a  PE  on  the  base 
of  the  pyramid,  and  at  most  one  PE  on  the  intermediate  levels. 
Top/Bottom  communication  in  the  pyramid  can  be  simulated 
in  this  scheme  using  only  three  hypercube  communication 
cycles  in  the  CM.  On  the  other  side,  mesh  communication 
occurring  in  all  intermediate  levels  of  the  pyramid  architecture 
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can  be  simulated  using  at  most  a  number  of  hypeicube 
communication  cycles  that  is  proportional  to  the  dimension  of 
the  cube  (16  in  the  case  of  a  64K  CM).  A  programming 
environment  for  pyramid  algorithms,  using  this  addressing 
scheme,  has  been  implemented  on  the  CM.  It  allows  the  user 
to  create  pyramid  data  structures,  load/unload  images  from 
various  pyramid  levels,  display  images  on  various  levels, 
move  data  up/down,  and  perform  several  operations  such  as 
convolution  and  hierarchical  operators  on  the  created  data 
structures. 

In  the  following  sections,  we  describe  the  addressing  scheme, 
and  show  how  various  pyramid  communication  modes  can  be 
simulated  on  the  CM.  A  brief  description  of  the  programming 
environment  is  then  presented. 

2  Mapping  Scheme 

The  Connection  Machine  uses  two  modes  of  communication; 
the  first  one  is  mesh-communication  where  the  whole  array  of 
PE’s  (64K)  forms  a  two-dimensional  orthogonal  mesh  (256  x 
256).  In  this  mode,  each  PE  may  communicate  with  its  four 
neighbors  in  the  east/west/north/south  directions  (NEWS 
network),  as  shown  in  Figure  2.  This  mode  of  communication 
is  very  fast  as  it  uses  direct  links  between  the  neighboring 
PE’s. 

The  second  mode  of  communication  is  the  hypercube 
communication,  in  which  each  PE  can  communicate  with  all 
the  PE’s  whose  binary  addresses  differ  from  its  own  address 
exactly  in  one  bit  location.  In  the  64K  machine,  this  means 
each  PE  is  also  connected  to  16  PE’s,  using  the  binary  16- 
dimension  hypercube  interconnection  network.  For  example, 
each  PE  communicates  in  the  direction  of  the  first  dimension 
with  the  PE  whose  address  is  computed  from  its  own  address 
by  flip-flopping  the  least  significant  bit  (1  — >  0,  0  -->  1).  In 
general,  each  PE  communicates  along  the  n-th  dimension  with 
the  PE  whose  address  differs  from  its  own  only  in  the  n-th 
least  significant  digit.  For  example,  PEO  is  connected  to  PEI 
along  the  first  dimension,  to  PE2  along  the  second  dimension, 
to  PE4  along  the  third  dimension, ...,  and  to  PE(2**15)  along 
the  16th  dimension.  Thus  the  maximum  number  of  links  that 
a  message  may  require  is  equal  to  the  number  of  binary  digits 
in  the  PE’s  addresses  (the  number  of  dimensions  in  the 
hypercube).  In  the  first  version  of  the  CM,  when  hypercube 
communication  is  invoked,  the  time  allocated  for  the 
communication  is  the  time  needed  for  the  message  to  travel 
from  one  node  to  its  neighbor  PE  in  the  hypercube  multiplied 
by  the  number  of  cube  dimensions.  This  time  must  also  take 
into  account  the  buffering  of  messages  that  arrive  at  a  specific 


node  in  the  network  at  the  same  time.  The  second  version  of 
CM  has  a  shorter  hypercube  cycle  (petit  cycle),  that  can  be 
used  when  it  is  know  that  messages  are  not  going  to  collide 
when  traveling  in  the  network.  In  our  scheme,  we  define  a 
new  cycle  time  for  hypercube  communication.  We  will  refer 
to  this  cycle  as  the  "hypercube  cycle".  This  is  defined  as  the 
time  required  for  communication  between  two  neighboring 
nodes  in  the  hypercube  network.  This  cycle  should  be 
extremely  fast  (one  nth  of  the  petit  cycle,  where  n  is  number  of 
hypercube  dimensions.) 

To  map  the  pyramid  architecture  on  the  connection  machine, 
the  CM  mesh  network  of  PE’s  simulates  the  base  of  the 
pyramid,  while  the  PE’s  on  the  internal  levels  of  the  pyramid 
are  going  to  be  simulated  according  to  the  following  scheme. 
The  lower  right  PE  of  each  2x2  cube  will  simulate  the  parent 
of  the  four  PE’s  in  this  2-D  cube.  Similarly,  for  the  second 
level  above  the  leaf  level  in  the  pyramid,  the  PE  connected  to 
the  PE  in  the  lower  right  comer  along  the  first  dimension  of 
the  hypercube  will  be  used  to  emulate  the  parent  of  the  PE’s  in 
the  4  x  4  cube  (4-D  hypercube)  A  general  formula  that 
specifies  the  CM  addresses  of  the  PE’s  on  the  intermediate 
levels  of  the  pyramid  is  given  as  following: 

4>.n  -  2'1,  where  i  is  the  level  number, 

n  changes  from  1  to  the  number  of  PE’s  on  that 

level  (22(8'i>). 

Figure  3  shows  this  mapping  for  an  8  x  8  mesh.  The 
hypercube  connections  are  not  shown  in  this  configuration,  but 
they  exist  according  to  the  scheme  described  before. 

The  mapping  scheme  described  above  ensures  that  each  PE  in 
the  pyramid  intermediate  levels  is  emulated  exclusively  by 
one  PE  in  the  CM  (one  to  one  mapping).  The  proof  is  as 
follows.  The  address  mapping  formula  is  4*n  -  2i_1.  On  the 
same  intermediate  level,  i  is  constant,  thus  changing  n  will 
result  in  a  different  address  each  time.  Now  to  prove  that  no 
two  intermediate  level  PE’s  are  mapped  to  the  same  CM  PE 
address,  First  assume  that  this  is  true.  It  means  that  for  PE  nl 
on  level  il,  there  exist  a  PE  n2  on  level  i2  such  that 

4nnl  -  2il_1  =  4i2n2  -  2’2"1. 

The  equation  can  be  re-arranged  as  follows: 

4i2(4ili2nl  -  n2)  =  2a{2iua  i  -  Tx). 


2i2(4'l-i2nl  -  n2)  =  (2'1-'2-1  -  0.5). 

The  left  hand  side  of  the  equation  is  always  an  integer  as  il, 
i2,  nl,  and  n2  are  all  integers,  while  the  right  hand  side  is  not. 
Thus  the  initial  assumption  is  not  true.  This  proves  that  this 
mapping  is  one  to  one. 


Now  there  are  two  types  of  communication  in  the  pyramid  that 
need  to  be  simulated,  the  top/down  communication 
(parent/child  communication),  and  the  lateral  communication 
in  the  intermediate  levels  of  the  pyramid. 

3  Simulating  Pyramid  Intermediate  Levels  Mesh 
Communication 

In  this  section,  we  describe  how  the  mesh  communication  on 
the  pyramid  intermediate  levels  can  be  simulated  on  the  CM 
using  the  addressing  scheme  introduced  in  the  previous 
section.  Note  that  Simulating  the  mesh  communication  on  the 
base  of  the  pyramid  is  performed  directly  using  the  NEWS 
network  of  the  CM.  In  what  follows  we  how  a  send-east 
operation  within  level  one  of  the  pyramid  is  simulated  in  the 
CM,  and  then  generalize  this  procedure  for  other  pyramid 
levels  (communication  in  the  other  directions  is  performed 
similarly). 

To  simulate  mesh  communication  within  level  one,  the 
information  is  sent  first  forward  along  the  third  dimension. 
Sending  information  forward  along  the  mh  dimension  means 
that  only  PE’s  whose  th  bit  is  0,  send  their  information  to  the 
PE’s  whose  binary  address  is  similar  except  that  the  nth  bit  is 
1.  Thus,  sending  forward  along  the  third  dimension  will 
involve  PE3  sending  the  information  to  PE7,  but  PE7  will  not 
send  its  information  to  PE3.  When  PE7  sends  its  information 
to  PE3,  we  will  refer  to  that  as  sending  the  information 
backward  along  the  third  dimension.  We  will  use  the 
following  notations  to  express  these  two  types  of  hypercube 
communication. 

-->3  (send  information  forward 
along  the  third  dimension.) 

3<--  (send  information  backward 
along  the  third  dimension.) 

Thus  sending  information  forward  along  the  third  dimension 
will  ensure  that  half  the  PE’s  on  level  one  have  sent  their 
information  to  their  east  neighbors  (PE3  to  PE7,  PE  19  to 
PE23,  ....,  etc).  Now  for  PE7  to  send  its  information  to  PE19, 
it  first  sends  the  information  along  the  5th  dimension  to  PE23, 
and  then  PE23  send  it  back  along  the  third  dimension  to  PE19. 
This  results  in  half  the  remaining  PE’s  sending  their 
information  to  their  east  neighbors.  In  our  own  terminology 
this  is  represented  as:  (->5  followed  by  3<~).  To  continue 
sending  the  rest  of  information  east,  similar  scheme  is 
followed.  The  complete  procedure  to  send  east  in  the  first 
level  of  the  pyramid  in  a  16-dimension  hypercube  is  given  by 
the  following  figure: 


-->5,  3<~ 

— >7,  5<— ,  3<— 

— >9,  7<~,  5<— ,  3<~ 

— >1 1, 9<~, 7<— , 5<— , 3<~ 

— >  1 3,  11  <— ,9<~,  7<— ,  5<~,  3<~ 

— >15, 13<— ,11  <~,9<~,  7<— ,  5<— ,  3<  - 

Note  that  these  communication  steps  can  be  pipelined  in  such 
a  way  that  there  will  be  no  repetition  of  the  same 
communication  steps.  To  do  that  a  temporary  variable  has  to 
be  created  to  prevent  overwriting  a  traveling  value.  For 
example  if  we  want  to  send  the  value  of  variable  A  to  variable 
B.  In  any  send  forward  operation  the  variable  A  is  sent  to 
variable  B,  while  in  a  send  backward  operation  variable  B  is 
sent  to  replace  the  variable  B  in  the  receiving  PE.  The 
pipelined  version  of  the  above  procedure  is  as  follows: 
During  each  of  these  communication  steps  all  of  the  PE’s  are 
actively  sending  or  receiving  information.  Note  also  that,  the 
above  scheme  can  be  augmented  easily  to  perform 
wraparound  shifting  as  well.  This  is  accomplished  by  a 
sequence  of  send  backward  communication  instructions.  For 
example,  in  the  16-dimension  hypercube  this  starts  with  15<--, 
where  the  A  variable  contents  is  sent  to  to  Variable  B 
backward  along  the  15th  dimension,  and  then  a  sequence  of 


1 3< — ,  — >13 
1 1 <— ,  ->11 
9< --,  — >9 
7<— ,  — >7 
5<-,->5 
3< -,  — >3 


n-3<-,  ->n-3 
n-5<— ,  ->n-5 


2i+ 1  < — ,  —  >2i+l 


n:  dimension  of  the  hypercube,  i:  level  number. 

send  backward  of  the  contents  of  variable  B  to  variable  B  in 
the  receiving  PE  (  13<-  ,11  < —  ,  ...  ,  3<~),  except  for  the  last 
one  where  the  contents  of  variable  B  is  sent  to  variable  A  only 
in  the  receiving  PE’s. 

In  general,  simulating  send-east  communication  within  level  i 
in  the  pyramid,  involves  following  the  above  procedure  except 
that  we  stop  when  information  is  sent  forward  along  the  the 
2*it  1  dimension.  That  means  that  all  the  intermediate  mesh 
communications  in  the  pyramid  can  be  performed 
simultaneously  by  disabling  the  receiving  PE’s  on  any  level 
when  they  have  completed  receiving  the  required  information. 
This  can  be  achieved  by  storing  in  each  PE  the  pyramid  level 
number  of  the  PE  it  simulates  in  the  pyramid  architecture. 


The  execution  time  of  the  procedure  to  simulate  all  pyramid 
mesh  communications  is  then  proportional  to  the  number  of 
dimensions  in  the  hypercube. 

Figure  4  shows  the  output  from  the  implementation  of  this 
scheme  for  a  5-levels  pyramid  (16  x  16  CM).  The 
implementation  has  been  performed  on  the  CMLisp  simulator 
running  on  a  Symbolic  machine.  In  this  simulation  we  assume 
no  wraparound  in  the  shift  operation.  The  communication 
steps  for  east  shift  operation  is  shown  for  levels  1  through  4. 

4  Simulating  Top/Down  Pyramid  communication 
on  the  CM 

To  simulate  the  pyramid  top/down  communication  on  the  CM, 
three  hypercube  communication  cycles  are  required  to 
simulate  the  level- to-level  communication  in  the  pyramid.  For 
example,  on  the  bottom-level  of  the  pyramid,  going  one  level 
up,  each  PE  sends  its  message  first  forward  along  the  first 
dimension,  then  forward  along  the  second  dimension.  For 
example,  PE3  is  the  parent  of  PE’s  0,1, 2, 3,  and  PEO  sends  its 
information  to  its  parent  node  by  sending  it  first  to  PEI,  and 
then  to  PE3.  In  our  notation  that  consists  of  two  steps  ~>1, 
and  ~>2.  Note  that  sending  information  from  PEI  to  its 
parent  involves  sending  its  information  forward  along  the 
second  dimension  (  ->  2),  and  sending  the  information  from 
PE2  to  its  parent  involves  sending  forward  along  the  first 
dimension  (-->  1)  Similarly  going  from  level  1  to  level  2 
involves  communication  forward  along  dimensions  3  and  4 
respectively  Thus,  PE3  sends  its  information  to  PE7  and  then 
P15.  PE  14  is  the  parent  node  of  PE’s  3,7,11,15.  Thus,  PE15 
sends  the  information  now  to  PE  14  backward  along  the  first 
dimension. 

Remember  that  in  the  pyramid  architecture,  each  parent  is 
connected  to  four  children  on  the  level  below  it.  We  refer  to 
these  four  children  as  UL(upper  left),  UR(pper  right), 
LL(lower  left),  and  LRflower  right).  Thus,  in  general,  to  send 
information  from  an  UL  child  on  level  i  to  its  parent  on  level 
i+1,  information  is  send  forward  first  along  dimension  2*i  +  1, 
then  forward  along  dimension  2*i  +  2,  and  finally  backward 
along  dimension  i.  Note  that  going  along  dimension  0  means 
no  operation.  In  our  notations  this  is  represented  by  the 
following  sequence: 

— >  2*i  +  1 
->  2*i  +  2 
i  <— 

Sending  the  information  from  child  UR  to  its  parent  is 
executed  by  the  following  sequence: 

— >  2*i  +  2 
i  <--, 

while  sending  information  from  child  LL  to  its  parent  involves 


the  following  sequence: 

~>  2*i  +  1 
i  <— 

Sending  information  from  the  LR  child  to  its  parent  involves 
only  one  communication  step  (i  <--).  It  is  clear  that  sending 
information  from  a  single  child  to  its  parent  executes  in  at 
most  3  steps. 

Sending  information  from  all  UL  children  to  their  parents  is 
executed  by  the  following  sequence: 

— >  1 ,  — >  2 , 0  <— 

— >  3  ,  — >  4 , 1  <— 

— >  5  ,  — >  6 , 2  <— 

— >  n-1 ,  — >  n  ,  <--  (n-l)/2 

Note  that  the  first  step  in  all  sequences  can  be  performed 
simultaneously  in  one  hypercube  cycle  if  each  PE  stores  its 
level  number  and  uses  it  to  compute  the  address  of  the 
receiving  PE.  Thus,  executing  a  global  send  parent  kind  of 
communication  takes  at  most  3  cycles  if  each  PE  has  its 
pyramid  level  number  stored  into  it.  Sending  from  all  children 
to  their  parent  executes  in  at  most  8  cycles.  In  conclusion 
simulating  the  parentichild  communication  executes  in  fixed 
time  regardless  of  the  hypercube  dimensions.  The 
parent/children  communication  has  been  also  implemented  on 
the  CMLisp  simulator  running  on  the  Symbolics  machine. 

5  Pyramid  Programming  Environment 

The  addressing  scheme  described  in  this  paper  has  been  used 
in  implementing  a  programming  environment  to  implement 
pyramid  algorithms.  This  environment  allows  the  user  to 
create  pyramid  data  structures,  load/unload  images  into 
various  pyramid  levels,  move  data  up/down,  display  pyramid 
data  structures  on  various  levels,  and  perform  several 
operations  such  as  convolution  and  hierarchical  operators  on 
the  created  data  structures.  This  programming  environment 
has  been  written  in  *LISP,  a  parallel  programming  language 
that  has  been  developed  at  Thinking  machines  Inc.  to  program 
the  CM.  The  *  prefix  is  used  as  a  naming  convention  with 
functions  to  indicate  that  these  functions  are  going  to  be 
executed  by  the  CM  hardware.  Operators  and  data  preceded 
by  !!  are  parallel  operators  and  data  that  are  performed  in  all 
PE’s  of  the  CM.  More  details  can  be  found  in  [9]. 

The  following  example  shows  how  to  use  this  environment  to 
find  edges  using  a  simple  multi-resolution  thresholding 
scheme.  Some  of  the  functions  used  are  *LISP  functions, 
while  those  that  perform  pyramid  operations  are  part  of  the 
programming  environment. 

(*defun  *find-edges  (pmd  start-level 

thresh  dest-pmd) 


I 

8$ 
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;  The  following  two  functions  defines 
;  pyramid  data  structures  variables. 
(*defpmd  'parent-value) 

(*defpmd  'edge) 

;  The  following  function  sets  the  pyramid 
;  variable  on  a  specified  level  to 
;  a  certain  value 

(*pmd-set-level  start-level 

'parent  value  ( ! !  thresh)) 


"refine"  is  the  thresholding  function. 
It  recursively  goes  down  the  pyramid 
from  start-level  to  level  0.  At  each 
level,  if  the  parent's  value  is  above 
the  threshold,  it  calculates  the  edge 
value  for  this  pixel,  and  store  the 
value  in  the  pyramid  variable  edge 
which  is  passed  afterwards  to  its 
children  as  the  parent  value. 

(refine  pmd  start-level  thresh) 


;  The  function  *pmd-set  sets  the  value 
;  of  the  pyramid  variable  in  all  levels 
;  equal  to  a  certain  value. 

(*all  (*pmd-set  dest-pmd  edge)) 

;  The  following  functions  deallocate 
;  pyramid  variables 

( ‘deal locate-*def pmd  parent-value) 

( ‘deal locate- ‘defpmd  edge) 

)  ;  end  ‘find-edges 
(‘defun  refine  (pmd  level  thresh) 

(‘all  (‘when  (and!!  (level!!  level) 

(>=!!  (*pmd  'parent-value 
level) 

( ! !  thresh) ) ) 

(*pmd-set-level  level  'edge 

(edge-op!!  pmd  level)))) 
(cond  (  (/=  level  0) 

;  send-level-children  is  an  example  of 
;  one  of  the  communication  primitives.  It 
;  sends  the  value  of  the  parent  on 
;  a  certain  level  to  its  children 

(‘all  send-level-children!!  level 
(‘pmd  'edge  level) 

(*pmd  ' parent -value 
(-  level  1) ) ) ) 

(refine  pmd  (-  level  1)  thresh)) 

(t  't)) 

)  ;end  refine 


the  function  edge  operator  uses  the 
"shift-level!!"  function  to  compute 
the  edge  function  value. 


6  Conclusion  and  Future  Work 

In  this  paper,  we  have  described  a  mapping  scheme  that 
efficiently  maps  pyramid  communication  on  the  CM.  It  has 
been  shown  that  inter-level  communication  (top/down)  is 
performed  in  3  hypercube  cycles.  Also,  mesh  (intra-level) 
communication  in  all  levels  can  be  performed  in  a  number  of 


hypercube  cycles  that  is  proportional  to  the  number  of 
pyramid  levels.  A  programming  environment  that  is  based  on 
this  scheme  has  been  introduced  briefly.  Currently  we  are 
implementing  many  of  the  pyramid  algorithms  using  this 
environment.  These  algorithms  include  parallel  texture  and 
stereo  algorithms. 
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Figure  1:  The  Pyramid  Architecture 


Figure  4:  East  Shift  Operation 
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Shifting  East  Random  Values  for  non-leaf  levels  (  A  — >  8) 
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Figure  3:  The  layout  of  the  pyramid  intermediate  levels 
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The  original  CM  mesh  represents  level  0. 
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Abstract 


An  approach  to  2D  model-based  object  recognition  is  developed, 
suitable  for  implementation  on  a  highly  parallel  SIMD  computer.  Ob¬ 
ject  models  and  image  data  are  represented  as  contour  features.  7Vunj- 
formation  sampling  is  used  to  determine  the  optimal  model  feature  to 
image  feature  transformation  by  sampling  the  space  of  possible  trans¬ 
formations.  It  is  shown  that  only  a  small  part  of  this  space  need  actu¬ 
ally  be  sampled  due  to  the  constraints  placed  on  possible  transforma¬ 
tions  by  individual  matches  of  image  features  to  model  features.  The 
procedure  requires  O(Trnn)  processors  and  0(lg2(Tmn))  time,  where 
m  is  the  number  of  model  features,  n  is  the  number  of  image  features, 
and  T  depends  on  the  size  of  the  image.  The  resulting  procedure  works 
well  and  is  extremely  robust  in  the  presence  of  occlusion.  An  imple¬ 
mentation  of  the  procedure  on  the  Connection  Machine  is  described, 
and  some  experimental  results  given. 


1  Introduction 


We  are  interested  in  understanding  how  to  perform  model-based 
object  recognition  on  a  parallel  computer.  The  primary  motiva¬ 
tion  is  that  it  may  be  possible  to  achieve  extremely  fast  model- 
based  recognition  by  exploiting  parallel  computation.  The  do¬ 
main  we  consider  in  this  paper  is  that,  of  2D  object  recognition 
from  digitized  images,  based  on  object  models.  Figure  1  shows 
an  example  of  the  kind  of  problem  we  expect  to  solve.  Here  we 
have  a  typical  input  image  of  a  scene  including  an  instance  of  the 
modeled  object  and  several  unknown  objects,  and  the  output  of 
the  system  indicating  recognition  of  the  object.  The  modeled 
object  is  shown  in  figure  9.  This  is  the  actual  input  and  output 
of  the  recognition  procedure  as  implemented  on  the  Connection 
Machine  parallel  computer)  13]. 

We  have  a  well  known  problem  with  a  new  constraint:  De¬ 
velop  a  robust  model-based  recognition  procedure  that  can  he 
implemented  efficiently  on  a  parallel  machine.  The  model-based 
recognition  task  is  to  find  the  pose  of  a  known  object  in  the  im¬ 
age,  which  is  equivalent  to  determining  a  transformation  which 
aligns  the  model  with  its  instance  in  the  image.  This  is  difficult 
for  two  major  reasons.  First,  in  a  feature- based  method,  such 
as  we  develop  here,  it  is  difficult  to  determine  the  correct,  corre¬ 
spondences  between  model  and  image  features  since  there  are  an 
exponential  number  of  mappings  from  image  features  to  model 
features.  This  is  further  complicated  when  some  model  features 
are  absent  from  the  image  due  to  occlusion,  and  when  there  are 
extra  image  features  due  to  the  presence  of  unknown  objects. 
Second,  even  given  a  correct  matching  of  image  features  to  model 
features,  there  is  probably  no  unique  transformation  aligning  the 
model  with  its  instance  in  the  image  since  there  is  uncertainty  as 
to  the  exact  location  of  image  features  due  to  sensing  difficulties, 
and  the  fact  that  image  features  may  be  only  fragments  of  their 
corresponding  model  features.  These  two  problems  conspire  to 
make  determining  a  valid  transformation  difficult.  The  technique 
that  we  shall  develop  here  is  based  upon  discretely  sampling  the 


space  of  possible  transformations  to  determine  the  transforma¬ 
tion  best  aligning  the  model  with  its  instance  in  the  image.  As 
we  will  see,  only  a  small  fraction  of  the  entire  space  need  be 
sampled.  This  leads  to  a  robust  and  fast  parallel  algorithm. 

The  difficulties  of  the  recognition  process  itself  are  well  known. 
A  significant  amount  of  work  has  been  devoted  to  2D  model- 
based  object  recognition,  and  several  good  systems  have  been 
developed  for  serial  computers[5]  [8] [6].  How  might  we  do  this 
in  parallel,  and  why  is  developing  a  parallel  procedure  difficult? 
One  approach  is  to  convert  an  existing  serial  algorithm  to  a  par¬ 
allel  one.  However,  it  may  not  be  possible  to  take  a  serial  algo¬ 
rithm  and  modify  it  to  work  on  a  parallel  architecture,  taking 
full  advantage  of  the  available  processing  power.  Recall  that  a 
problem  of  size  M  requiring  ft  (/(A/))  time  on  a  single  processor 
machine,  requires  time  ft(/(A/)/A)  on  a  parallel  machine  with 
0(N)  processors.  So  at  best  we  get  a  speedup  by  a  factor  of 
N .  In  practice,  however,  we  often  cannot  achieve  this  speedup 
depending  on  the  nature  of  the  problem.  As  a  point  of  inter¬ 
est,  there  are  problems  solvable  in  polynomial  time  on  a  single 
processor  that,  it  is  conjectured,  cannot  be  solved  in  0(\gk  M) 
time  for  any  tc  on  a  polynomial  sized  parallel  machine  (I4|.  So 
while  a  good  parallel  algorithm  can  be  translated  into  a  good 
serial  algorithm,  running  a  factor  of  0(A)  slower,  by  straightfor¬ 
ward  simulation  of  the  parallel  machine,  it  may  not  be  possible 
to  take  a  serial  algorithm  and  achieve  a  speedup  of  N  by  parallel 
implementation. 

As  just  mentioned,  we  seek  the  transformation  aligning  the 
model  with  the  object  in  the  image.  We  introduce  the  technique 
of  transformation  sampling.  Simply  stated,  the  3D  space  of  all 
possible  transformations  of  the  model  is  sampled  on  a  3D  grid, 
where  a  point  in  this  space  represents  a  particular  transformation 
consisting  of  a  rotation,  4><  about  the  origin  and  a  translation,  u 
and  r,  in  position.  This  is  a  3D  space  since  we  assume  that  scale 
is  known.  As  we  will  see,  it  is  not  necessary  to  sample  the  entire 
space,  hut  only  regions  corresponding  to  transformations  which 
will  keep  some  match  of  a  model  feature  and  an  image  feature 
near  each  other  after  transforming  the  model  feature.  Indeed, 
it  is  easy  to  see  that  there  are  many  transformations  which  will 
result  in  none  of  the  model  features  being  near  any  image  feature. 
These  transformations  ran  be  safely  ignored.  This  procedure  is 
easily  implemented  on  a  parallel  machine  resulting  in  a  fast,  and 
quite  robust  21)  recognition  procedure. 

The  transformation  sampling  technique  hears  resemblance  to 
two  existing  recognition  tools,  clustering,  such  as  the  Hough 
transform]  10|(  11 1[7|,  and  global  measure  optimization.  As  we 
will  show,  transformation  sampling  does  not  have  several  of  the 
characteristic  shortcomings  of  Hough  transform  methods,  and  is 
efficiently  implemented  in  parallel. 

The  contribution  of  this  work  is  the  development  of  the  tech 
niqueof  transformation  sampling,  the  description  of  a  parallel  al¬ 
gorithm  for  21)  model- based  recognition,  and  the  implementation 
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Figure  1:  An  example  of  an  input  image  and  the  recognized 
object.  The  modeled  object  is  shown  in  figure  9. 

and  demonstration  of  the  procedure.  Transformation  sampling 
is  key  to  the  procedure,  and  is  not  subject  to  some  of  the  prob¬ 
lems  with  commonly  used  recognition  techniques  like  the  Hough 
transform.  We  will  show  that  this  procedure  requires  0(Tmn) 
processor"  and  <9(lg2(7’mrr))  time,  where  i  dependi*  the  size 
of  the  image,  and  m  and  n  are  the  number  of  model  and  image 
features  respectively.  Given  the  assumption  that  the  image  size 
is  constant,  we  can  consider  T  constant. 

We  employed  two  very  general  criteria  to  guide  development 
of  this  parallel  recognition  procedure:  (1)  The  procedure  must 
be  robust  in  the  presence  of  sensing  errors,  as  well  as  the  pres¬ 
ence  of  several  occluding  objects.  (2)  The  procedure  should  be 
implementable  on  a  parallel  computer  so  that  it  runs  fast  on  a 
parallel  machine,  say  0(lg*'  M)  time  on  0(MJ )  processors,  where 
M  characterizes  the  size  of  the  problem.  In  short,  it  must  work 
and  run  fast  on  a  realizable  machine. 

2  Background 

Before  beginning  the  specific  description  of  the  recognition  al¬ 
gorithm  we  need  to  briefly  introduce  some  background  material. 
This  section  provides  some  of  the  background  concepts  we  will 
rely  on  as  we  develop  the  recognit  ion  algorithm.  Here  we  describe 
the  representation  of  models  find  image  data;  we  formalize  the 
notion  of  transformation  of  the  model,  and  we  discuss  the  con¬ 
cept  of  transform  parameter  space.  This  facilitates  the  formal 
definition  of  the  recognition  problem. 


2.1  Models  and  Image  Data 

2.1.1  Features 

The  modeled  objects,  as  well  as  the  image  data,  are  two- 
dimensional.  Objects  are  represented  by  their  boundary  con¬ 
tours,  which  in  turn  are  represented  by  a  set  of  contour  features. 
The  contour  features  are  characterized  by  a  unique  position  and 
orientation.  We  will  consider  two  types  of  features,  point  features 
and  extended  features .  Point  features  simply  consist  of  the  orien¬ 
tation  of  the  contour  normal  at  a  specific  position,  and  are  used 
primarily  for  the  theoretical  development  of  the  algorithm.  Ex¬ 
tended  features  are  simply  straight  line  segments  approximating 
sections  of  the  contour,  and  are  actually  used  in  the  experimen¬ 
tal  implementation.  Both  the  stored  model  of  an  object  and  the 
objects  in  input  image  are  represented  by  these  features. 

2.1.2  Uncertainty 

We  assume  that  noise  in  the  image,  spatial  distortions  in  the 
imaging  device,  edge  detection,  as  well  as  the  polygonal  approx¬ 
imation  of  the  contour  result  in  uncertainty  in  the  position  and 
orientation  of  each  image  feature.  This  means  that  the  measured 
position  and  orientation  of  a  feature  in  the  image  may  differ  from 
the  actual  position  and  orientation  of  the  feature  in  the  scene. 
We  assume  that  this  uncertainty  is  bounded,  and  let  R  denote 
the  upper  bound  on  the  uncertainty  of  the  position  of  an  image 
feature,  and  0  denote  the  upper  hound  on  the  orientation  un¬ 
certainty  of  an  image  feature.  Thus,  considering  a  point  feature 
for  example,  the  true  position  in  the  scene  of  an  image  feature 
falls  within  a  circle  of  radius  R  centered  at  the  measured  posi¬ 
tion,  and  the  true  orientation  of  the  features  is  within  0  radians 
of  the  measured  orientation.  This  follows  closely  the  representa¬ 
tion  used  by  Crimson  and  Lozano-Perez[8|. 

2.2  Transformations 

Given  a  model  of  an  object  represented  as  a  set  of  features,  and 
a  set  of  data  features  representing  the  input  image,  if  the  scene 
contains  the  modeled  object  there  exists  a  transformation  on  the 
model  which  will  align  it  with  its  instance  in  the  image.  If  we 
let  M#  represent  a  2D  rotation  matrix,  d  represent  the  position 
of  an  image  feature,  and  m  the  position  of  its  corresponding 
model  feature,  then  d  =  M^ifi  +■  t  for  some  0  and  1,  where  0 
represents  the  rotation  of  the  model  feature  and  t  =  (u  t>]  is 
the  translation  in  the  x  and  y  directions.  Thus  a  transformation 
is  parameterized  by  (0,  u,r)  and  consists  of  a  rotation  about 
some  origin,  followed  by  a  translation. 

When  speaking  of  a  transformation,  we  mean  a  relative  trans¬ 
formation  between  a  particular  model  feature  and  image  feature. 
The  term  feature  match  will  be  used  extensively  to  describe  a 
particular  pair  of  a  model  feature  and  an  image  feature. 

Let  {/??,}  deccribe  the  set  of  model  features,  and  { dj }  describe 
the  set  of  image  features.  Then  ml%dj  >  describes  a  particular 
feature  match  and  S  {  »»!,,</,  }  {m, }  x  {dj}  is  the  set 

of  all  possible  feature  matches.  Wo  say  that  a  subset  S'  C  S 
is  a  mapping  or  a  matching  between  model  features  and  data 
features.  Further,  let  m  -  i {/»,}(,  and  n  -  |{dj}|. 

2.3  Recognition 

Since  there  is  uncertainty  as  to  the  actual  position  and  orienta¬ 
tion  of  an  image  feature,  when  considering  a  particular  feature 
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match,  any  transformation  which  brings  the  model  feature  to 
within  the  uncertainty  bounds  on  position  and  orientation  of  the 
image  feature  could  be  the  correct  transformation.  We  say  that 
in  this  case  the  features  align  to  within  the  uncertainty  bounds. 
We  say  a  set  of  feature  matches,  S',  is  globally  consistent  if  there 
exists  some  transformation  on  the  model  features  which  simul¬ 
taneously  aligns  all  the  matches  in  the  S'. 

Recognition  consists  of  determining  the  presence  of  an  object 
in  the  image,  and  of  finding  the  pose  of  the  object:  position 
and  orientation.  Of  course,  the  position  and  orientation  of  an 
object  in  the  image  are  directly  related  to  the  transformation 
aligning  the  model  with  its  instance  in  the  image.  In  the  next 
section  we  will  associate  a  function  F(<j>,u,v)  with  each  trans¬ 
formation.  We  will  define  recognition  as  finding  a  transforma¬ 
tion  producing  an  optimal  value  of  F(<t>,  u,  v).  By  appropriate 
definition  of  F(<p,u,v),  recognition  consists  of  finding  large  sets 
S'  C  {<  m,,dj  >}  =  {m;}  X  {dj}  that  are  globally  consis¬ 
tent,  and  the  transformations  at  which  the  matches  of  S'  align. 
Note  that  if  an  image  feature  and  a  model  feature  are  incorrectly 
matched,  but  still  align  within  the  bounds  for  some  transforma¬ 
tion,  then  there  is  no  way  to  distinguish  them  from  correctly 
matched  features,  and  they  will  be  considered  as  such. 

2.4  Transformation  Parameter  Space 

For  each  match  there  is  a  range  of  rotations  which  will  bring 
the  orientation  of  the  model  feature  to  within  the  orientation 
uncertainty,  0,  of  the  image  features  orientation.  For  each  of 
these  rotations  of  the  model  feature,  after  performing  it,  there  is 
a  range  of  translations  which  will  bring  the  position  of  the  model 
feature  to  within  the  position  uncertainty,  R,  of  the  image  fea¬ 
ture.  Each  transformation  can  be  represented  as  a  point  (<j>,  u,  v) 
in  the  the  transformation  parameter  space  TP  =  B2  x  S02  where 
SO 2  is  the  2D  rotation  group,  <p  the  rotation  and  u  and  v  the 
translation  in  the  t  and  y  directions  respectively.  Thus,  the  set 
of  valid  transformations  which  leave  the  model  feature  aligned 
within  the  uncertainty  bounds  with  the  image  feature  form  a 
region  in  parameter  space  we  call  a  match  region. 

3  Recognition  as  Sampling  Parameter 
Space 

By  appropriate  choice  of  the  function  F(<£,  u,  r)  characterizing 
transformations,  recognition  will  consist  of  finding  a  transforma¬ 
tion  which  aligns  a  large  number  of  feature  matches.  This  corre¬ 
sponds  to  finding  the  regions  in  the  transform  parameter  space, 
TP,  which  intersect  a  large  number  of  match  regions  when  the 
match  regions  for  all  possible  feature  matches  are  constructed  in 
TP.  Such  a  region  defines  a  mapping  between  model  and  image 
features,  and  provides  a  range  of  transformations  which  aligns 
each  match  in  the  mapping. 

3.1  The  Structure  of  Parameter  Space 

Consider  each  match  <  mt , d}  s  as  defining  a  scalar  field  in 
TP,  which  has  value  /,j(0,  u,  r)  -  cXJ  >  0  for  (<£,  u,r)  within 
the  match  region  and  0  outside  it;  where  rtJ  is  a  constant  de¬ 
pending  on  the  match  <  mt^d}  Let  the  value  of  the  field 
resulting  from  the  superposition  of  the  /tJ  in  TP  be  given  by 
F(<f>, «,  v)  =  ftji&i  u» v)-  1°  this  particular  case  F(4>,  u,  v)  is 
a  piecewise  constant  function  which  changes  value  only  at  the 


boundary  of  a  match  region.  Call  these  piecewise  constant  re¬ 
gions  intersection  volumes.  In  general,  we  could  define  u,  t») 
as  any  function  quantifying  properties  of  its  associated  match 
<■  mt,dj  >;  similarly,  the  value  of  F(<£,  u,i>)  could  be  any  func¬ 
tion  of  the  ftJ. 

3.2  Finding  the  Optimal  Transformation 

Let  an  optimal  transformation  be  defined  as  a  transformation 
with  a  maximal  value  of  F(<j>,uyv).  Finding  the  optimal  trans¬ 
formation  amounts  to  finding  the  intersection  volume  with  an 
optimal  value  of  F(</>,  ti,  r).  For  convenience  we  will  call  the  value 
of  an  intersection  volume  the  value  of  F(</>,  u,  i»)  in  the  region.  In 
fact,  all  we  really  need  to  do  is  find  a  sample  point  in  parame¬ 
ter  space  which  falls  in  this  optimal  intersection  volume.  Such  a 
point  defines  the  same  mapping  between  model  and  image  fea¬ 
tures  as  the  entire  containing  intersection  volume,  and  provides 
a  globally  consistent  transformation. 

3.3  Transformation  Sampling 

Part  cf  the  difficulty  in  determining  a  transformation  lies  in  de¬ 
termining  which  model  features  and  image  features  correspond. 
Some  existing  recognition  systems  approach  the  problem  by  first 
determining  a  set  of  corresponding  feature  matches,  and  then  de 
termining  a  transformation  from  these  matches[8][5j,  but  finding 
this  matching  is  difficult.  We  accomplish  both  in  one  step,  by 
finding  a  transformation  that  aligns  a  l&rge  fraction  of  the  model 
with  its  instance  in  the  image. 

We  have  recast  the  recognition  problem  as  the  problem  of  find¬ 
ing  regions  of  TP,  specifically  intersection  volumes,  which  have 
an  optimal  value  of  the  function  F(0,  u,  t>).  One  approach  to  this 
is  to  simply  sample  points  (<j>.  u,t>)  in  TP  to  find  one  with  an  op¬ 
timal  value.  An  alternative  perspective  of  this  idea  is  that  we 
pick  a  particular  transformation  and  ask  each  match  in  parallel 
“Does  this  fall  in  your  match-region for  all  possible  matches. 
When  we  pick  a  transformation  near  the  correct  one  the  answer 
will  be  “Yes”  for  many  matches.  This  is  just  what  transformation 
sampling  does. 

A  systematic  approach  would  he  to  sample  the  entire  parame¬ 
ter  space  on  a  regular  grid,  to  find  some  such  sample  points  which 
fall  inside  optimal  intersection  volumes.  This  is  a  finite  but  ex¬ 
tremely  large  set  of  sample  points.  However,  we  can  reduce  the 
number  of  sample  points  considerably  by  noting  that  for  a  par¬ 
ticular  feature  match  we  need  only  consider  those  sample  points 
which  actually  fall  inside  its  match  region.  Thus,  assuming  that 
an  average  of  T  sample  points  fall  inside  each  match  region,  we 
need  only  consider  O(Tmn)  transformation  sample  points,  where 
mn  is  the  number  of  feature  matches. 


4  The  Recognition  Algorithm 

In  this  section  we  briefly  outLine  the  steps  of  the  algorithm,  and 
analyze  its  complexity.  In  a  later  section  we  will  develop  a  more 
formal  theoretical  foundation  for  the  algorithm  and  look  at  it  in 
the  context  of  parallel  implementation. 

4.1  The  Algorithm 

We  will  consider  features  consisting  of  straight  line  segments  ap 
proximating  the  boundary  contours.  In  its  basic  form  the  algo¬ 
rithm  consists  of  the  following: 
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•  Extract  image  features 

•  Form  all  matches  <  m,  d  >  of  model  and  image  features 

•  Compute  the  range  of  valid  relative  rotations  for  each  match 

•  Determine  the  set  of  rotation  sample  points  in  this  range  of  ro¬ 
tations 

•  For  each  rotation  sample  point,  compute  the  range  of  valid  rel¬ 
ative  translations 

•  For  each  rotation  sample  point,  determine  the  set  of  translation 
sample  point  in  this  range  of  translations 

•  For  each  complete  parameter  space  sample  point  thus  con¬ 
structed,  compute  the  value  of  F(0,  u,  v)  at  the  point 

•  Select  the  transformation  sample  points  with  optimal  values  of 
F(0,  u,  v) 

•  Determine  the  objects  position  in  the  image  from  the  optimal 
transformation  sample  points 

4.2  Complexity  of  the  Algorithm 

We  will  not  consider  the  feature  extraction  phase  of  the  algo¬ 
rithm.  The  complexity  of  the  matching  phase  depends  on  how 
many  transformation  sample  points  we  consider  for  each  match, 
and  on  how  many  matches  we  have.  Since  for  a  particular  match 
we  are  only  interested  in  the  transformations  within  its  match 
region,  how  many  transformation  sample  points  does  the  match 
region  enclose?  We  sample  transformation  parameter  space  on  a 
regular  grid  in  the  3  dimensions  0,  u,  and  v.  Let  6<p  radians  be 
the  sampling  interval  in  the  rotational  dimension,  and  St  be  the 
sampling  interval  in  each  of  the  translational  dimensions.  The 
important  point  to  note  is  that  only  the  sample  points  which 
fall  within  the  match- region  for  some  feature  match  need  ever 
be  considered.  For  a  particular  match  the  size  of  the  match 
region  depends  on  three  things:  R  and  0,  the  uncertainty  in  po¬ 
sition  and  orientation  respectively,  and  the  difference  in  length 
between  the  model  feature  and  the  image  feature.  Thus,  there 
are  r  =  rotation  sample  points  to  consider.  See  figure  2. 

Considering  again  this  particular  feature  match,  for  each  valid 
rotation  sample  point,  after  the  model  segment  has  been  rotated 
according  to  the  specified  rotation  sample  point,  there  will  be 
a  set  of  transformation  sample  points  which  fall  in  the  range  of 
translations,  u  and  t»,  which  bring  the  matched  features  within 
a  distance  of  R.  We  call  this  set  of  sample  points  a  translation 
neighborhood.  For  a  giver  rotation  sample  point,  the  width  of  the 
translation  neighborhood  in  sample  points,  in,  is  given  approxi¬ 
mately  by  in  =  [7? J*  The  length  of  this  region  is  a  function  of 
the  difference  in  length  between  the  image  and  model  segments. 
This  is  because  the  image  segment,  if  shorter,  can  slide  along  the 
length  of  the  model  segment  and  still  be  within  R.  Let  DtJ  denote 
the  difference  in  length  between  matched  features  •  dt.mj 
For  a  given  rotation,  the  length  of  the  t  ranslat  ion  ^neighborhood 
in  sample  points  is  given  approximately  by  ltJ  -  ! — J.  So  for 
match  1,  j,  the  number  of  discrete  transformation  sample  points 
considered  is  TtJ  -  rwlXJ. 

Suppose  we  have  an  image  of  a  given  size.  If  we  consider  many 
different  models  and  data  sets,  we  could  determine  reasonable 
average  case  values  for  lXJ.  To  simplify  the  analysis,  let  l  he  the 
maximum  over  many  model  and  data  sets,  of  the  average  of  ltJ 
over  the  matches  for  each  pairing  of  image  and  model  feature  sets. 
Defining  T  -  rud,  the  total  number  of  different  transformations 
that  need  be  considered  is  O(  Fmn). 


Image  Feature  J 


>  (0,0) 


_ Model  Feature  j 

Figure  2:  An  illustration  of  the  uncertainty  bound  0,  and  the 
set  of  possible  rotation  samples  for  a  particular  match. 

Can  we  really  consider  the  number  of  sample  points  per  match 
region  as  a  onstant?  The  number  of  sample  points,  T  depends 
on  the  sampling  intervals  b<p  and  St,  and  on  the  difference  in 
length  between  the  matched  features.  The  size  of  the  image 
provides  an  upper  bound  on  the  difference  in  length  between 
matched  features,  so  this  only  depends  on  the  size  of  the  image, 
/.  It  remains  to  be  shown  in  section  6  that  6<t>  and  St  are  inde¬ 
pendent  of  m,  n,  and  I.  In  this  case  T  depends  only  on  /,  which 
might  reasonably  he  considered  constant. 

5  Related  Work 

5.1  Transformation  Sampling  is  not  a  Hough  Trans¬ 
form 

Clustering  has  been  used  as  a  complete  recognition  technique, 
for  example  see  Stockman  [7].  Transformation  sampling  is  not 
simply  a  Hough  transform  or  similar  clustering  technique.  There 
are  some  key  characteristics  of  the  transformation  sampling  tech¬ 
nique  which  make  it  much  more  effective  than  clustering  tech¬ 
niques.  Here  we  will  argue  that  while  the  technique  is  similar 
to  Hough  transform  techniques,  there  are  some  important  differ- 


5.1.1  Problems  With  Clustering  as  a  Recognition  Tech¬ 
nique 

The  basic  idea  behind  clustering  techniques  for  recognition  can 
be  easily  described  procedurally.  All  pairs  of  model  and  image 
features  are  formed,  for  each  such  match,  there  is  a  relative 
transformation  on  the  model  feature,  rotation  and  translation, 
that  will  align  the  matched  features.  This  is  computed,  and  a 
point  is  placed  in  the  ill)  transformation  parameter  space  cor¬ 
responding  to  this  transformation.  The  idea  is  simply  that  cor¬ 
rectly  matched  features  will  yield  approximately  the  same  rela¬ 
tive  transformation,  and  thus  a  cluster  in  parameter  space  will 
lie  formed.  Good  candidates  for  the  correct  transformation  are 
found  by  searching  parameter  space  for  such  a  cluster.  In  princi¬ 
ple,  this  is  an  excellent  approach.  In  the  ideal  case  there  would 
be  a  sharp  peak  in  parameter  space  corresponding  to  the  correct 
transformation.  In  practice,  there  are  two  main  factors  which 
make  correctly  identifying  the  correct  transformation  difficult: 
error  in  the  measurement  of  the  pose  of  image  features,  and  spu¬ 
rious  points  in  parameter  space  due  to  incorrect  matches. 
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First  we  consider  the  effect  of  uncertainty  in  the  pose  of  the 
image  feature.  There  is  a  strong  coupling  between  the  rotation 
component,  and  the  translation  component  of  the  transforma¬ 
tion  aligning  a  model  feature  with  an  image  feature.  An  error 
in  the  estimate  of  the  orientation  of  an  image  feature  results  in 
an  incorrect  relative  rotation,  which  in  turn  results  in  an  incor¬ 
rect  relative  translation.  This  has  the  effect  of  spreading  out  the 
cluster  in  space,  so  that  although  features  are  correctly  matched, 
they  do  not  all  correspond  to  the  same  point  in  parameter  space. 
If  the  feature  is  away  from  the  center  of  rotation,  it  is  easy  to 
see  that  uncorrelated  error  in  the  orientation  of  several  image 
features  will  lead  to  a  helical  shaped  distribution  of  points  in  pa¬ 
rameter  space,  for  correctly  matched  feat  tires.  There  is  another 
factor  which  contributes  to  the  dilution  of  clusters:  image  fea¬ 
ture  fragmentation.  When  an  image  feature  is  only  part  of  its 
corresponding  model  feature,  there  is  no  way  a  priori  to  deter¬ 
mine  a  unique  translation  aligning  the  two  features.  Thus  there 
is  a  region  of  uncertainty  in  which  the  point  could  fall  in  param¬ 
eter  space,  further  spreading  out  the  cluster.  One  solution  to 
the  problem  of  spread  out  clusters  is  to  smooth  the  parameter 
space.  The  idea  is  to  represent  each  match  in  parameter  space 
wherever  it  could  hill,  according  to  the  pose  uncertainty.  It  is 
difficult  to  smooth  the  space  correctly,  which  results  in  matches 
falling  in  parameter  space  where  they  should  not  fall,  and  not 
falling  where  they  should. 

Second,  we  consider  the  points  in  parameter  space  contributed 
by  incorrectly  matched  features.  These  points  contribute  to  what 
might  be  called  the  overall  background  nois*  of  the  parameter 
space.  As  the  level  of  this  background  noise  increases,  that  is, 
as  the  number  of  points  in  parameter  space  due  to  incorrect 
matches  increases,  peaks  or  clusters  in  parameter  space  become 
harder  to  distinguish.  The  presence  of  unknown  objects  in  the 
image  results  in  more  pairs  of  image  and  model  feature,  and  more 
points  contributed  to  parameter  space,  which  contributes  greatly 
to  the  background  noise. 

These  clustering  techniques  are  commonly  implemented  by  tes- 
scllating  parameter  space  into  bins.  All  points  falling  within  the 
same  cell  of  this  tesselatiori  are  considered  members  of  the  same 
bin.  To  find  clusters  we  simply  find  the  bin  with  the  most  points 
in  it.  To  handle  the  effect  of  uncertainty  in  the  pose  of  the  image 
feature,  the  cells  must  be  made  large  enough  to  include  most  of 
a  cluster,  ami  thus  the  size  of  the  bins  depends  on  the  bounds  on 
pose  uncertainty.  Alternativly,  the  tessalaiton  is  made  smaller 
and  the  spare  is  smoothed  as  above. 

The  bin  with  the  most  matches  falling  in  it  may  not  be  as¬ 
sociated  with  the  optimal  transformation,  since  it  is  quite  pos¬ 
sible  that  no  single  transformation  will  simultaneously  align  all 
matches  in  the  bin.  We  ran  see  this  by  noting  that  the  regions 
of  valid  transformations  for  two  different  matches  may  intersect 
the  same  bin,  hut  not  intersect  one  another,  creating  a  false 
peak  in  the  tessalated  parameter  space.  See  figure  3.  This  is  one 
reason  why  the  clustering  technique  alone  is  not  adequate  for 
recognition.  Grimson  and  Lozano- Perez |8;:  simply  use  the  Hough 
transform  to  roughly  order  the  matches  they  will  consider  in  a 
subsequent  constraint  based  search  procedure. 

In  summary,  error  in  image  feature  pose,  and  fragmentation 
spread  out  the  clusters  in  parameter  space.  Incorrect  matches 
add  to  the  background  noise,  watering  down  the  relative  strength 
of  peaks  in  the  space,  ami  leading  to  false  peaks.  And  finally,  all 
features  in  a  bin  are  not  necessarily  globally  consistent. 


Figure  3:  The  squares  represent  slices  of  Hough  bins,  the  circles 
represent  regions  of  valid  transformations  for  matches.  Although 
a  Hough  bin  may  have  several  members,  they  are  not  necessarily 
mutually  consistent. 

5.1.2  The  Advantages  of  Transformation  Sampling 

In  contrast  to  clustering  techniques,  transformation  sampling  ex¬ 
plicitly  accounts  for  pose  uncertainly  and  feature  fragmentation. 

Transformation  sampling  can  lie  related  to  the  Hough  trans¬ 
form  by  considering  the  size  of  the  cells  to  he  infinitesimal,  and 
the  spacing  of  the  cells  in  each  dimension  to  he  the  transforma¬ 
tion  sampling  intervals.  Following  the  Hough  transform  anal¬ 
ogy  further,  after  each  point  is  placed  in  parameter  space  it  is 
smeared  out  so  that  it  falls  in  all  the  bins  that  it  possibly  could 
given  the  pose  uncertainty  hounds.  Thus  every  match  is  repre¬ 
sented  at  all  sample  point  s  at  which  it  possibly  falls  in  parameter 
space,  and  no  more.  Note  that  since  it  is  as  though  the  bin  size 
were  infinitesimal,  all  matches  that  fall  in  a  particular  bin  are 
globally  consistent  with  one  another. 

We  noted  that  in  a  standard  Hough  transform  technique,  the 
presence  of  unknown  objects  in  the  image  contribute  to  the  back¬ 
ground  noise  in  the  parameter  space,  and  thus  reduce  the  accu¬ 
racy  of  the  clustering  technique.  In  the  case  of  transformation 
sampling  this  background  noise  due  to  incorrect  matches  makes 
absolutely  no  difference  in  accuracy,  but  simply  increases  the 
complexity  of  t  he  procedure  since  there  are  more  model  feature 
image  feature  matches  to  consider.  Noise  makes  no  difference 
since  if  an  incorrect  match  is  valid  at  a  particular  transforma¬ 
tion  sample  point,  then  to  the  best  of  our  knowledge  it  is  a  correct 
match. 

Transformation  sampling  easily  accounts  lor  any  amount  of  er¬ 
ror  in  the  image  feature  pose,  by  simply  considering  a  larger  num¬ 
ber  of  possible  transformation  sample  points.  This  mechanism 
also  handles  problems  with  image  feature  fragmentation  since  the 
fragment  will  be  considered  at  all  points  along  the  mode!  feature 
matched  to  it.  I  lius  we  are  able  In  use  all  image  data  to  the  full 
extent,  without  limiting  matches  to  lie  composed  ol  features  of 
essentially  the  same  length. 

5.1.3  False  Positives  versus  False  Negatives 

We  can  characterize  the  main  differences  between  transformation 
sampling  and  Hough  transforms  by  noting  that  transformation 
sampling  in  a  sense  trades  false  negatives  lor  false  positives.  We 
have  argued  that  the  Hough  transform  may  indicate  a  popular 
transformation  which  in  fact  is  not  globally  consistent.  On  tin* 
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other  hand,  transformation  sampling  will  never  indicate  anything 
but  a  globally  consistent  transformation.  However,  since  we  are 
sampling  parameter  space,  it  is  possible  that  we  could  miss  a 
peak  in  parameter  space,  and  thus  miss  a  transformation  which 
the  Hough  transform  may  have  found.  It  must  be  shown  that 
given  fine  enough  sampling  intervals  this  will  not  happen.  In 
section  6  we  will  examine  this  more  fully. 

5.2  Constrained  Search  Based  Recognition 

We  have  not  considered  in  detail  the  use  of  a  constrained  search 
based  recognition  technique  for  parallel  recognition  because  this 
formulation  may  not  work  well  in  parallel,  and  because  it  does 
not  exploit  all  the  available  constraint.  The  systems  by  Crim¬ 
son  and  Lozano-Perez(8],  and  by  Bolles  and  Cain[5]  both  rely 
on  constrained  search  based  upon  the  pairwise  constraints  be¬ 
tween  feature  matches.  The  idea  is  to  hypothesize  a  matching 
between  model  and  image  features  from  which  a  transformation 
can  be  derived.  This  is  accomplished  by  finding  a  large  set  of 
pairwise  consistent  matches.  We  can  see  that  this  is  a  difficult 
problem  by  considering  the  feature  matches  as  nodes  in  a  graph, 
where  there  is  an  edge  between  nodes  if  the  matches  are  pair¬ 
wise  consistent.  This  is  simply  the  maximum  clique  problem  on 
a  graph.  This  problem  is  NP-Complete(12],  but  in  practice  this 
works  well  if  the  matches  are  considered  in  carefully  chosen  sub¬ 
sets  to  reduce  complexity.  However,  it  seems  that  this  is  not 
the  best  approach  to  consider  for  a  parallel  implementation  for 
two  reasons.  First,  it  seems  the  sequential  nature  of  the  search 
process  is  what  makes  the  constraints  effective  at  pruning  the 
search  space.  It  is  clear  that  we  cannot  simply  explore  the  entire 
search  space  in  parallel  since  it  is  enormous.  Second,  the  pair¬ 
wise  consistency  formulation  does  not  exploit  all  the  constraint 
available,  since  it  looks  for  sets  of  pairwise  consistent  matches, 
when  in  fact  any  correct  matching  we  construct  must  be  glob¬ 
ally  consistent,  that  is,  all  matches  must  align  under  the  same 
transform.  The  transformation  sampling  technique,  on  the  other 
hand,  is  easily  implemented  in  parallel,  and  explicitly  relies  on 
the  global  consistency  constraint.  Thus,  transformation  seems  a 
more  attractive  approach  to  parallel  recognition. 

6  A  Formal  Basis  for  Transformat  on 
Sampling 

One  of  the  potential  problems  with  transformation  sampling  is 
that  an  optima)  transformation  will  be  missed  because  the  sam¬ 
pling  intervals  are  two  large.  Due  to  the  particular  configuration 
of  the  measurement  errors  for  the  set  of  image  features,  it  is  pos 
sible  that  the  optimal  intersection  volume,  the  place  where  most 
match  regions  overlap,  is  so  small  that  no  sample  point  falls  in¬ 
side  it,  thus  missing  it.  In  this  section  we  illustrate  that  although 
we  may  miss  the  point  of  greatest  overlap,  it  is  very  likely  that 
we  will  sample  a  point  nearby  it  in  parameter  space,  where  many 
of  the  correct  match  regions  overlap. 

6.1  Background 

For  this  development  we  will  consider  point  features,  consisting 
of  a  position  and  orientation  but  without  spatial  extent.  We 
will  then  extend  the  result  to  include  the  extended  features  used 
in  the  implementation.  Each  match  between  a  model  and  im¬ 
age  feature  defines  a  region  of  transformation  parameter  space 


we  call  a  match  region.  Transformation  sampling  seeks  to  find  a 
region  where  many  such  match  regions  intersect,  defining  a  trans¬ 
formation  which  accounts  for  a  large  amount  of  the  data  derived 
from  the  image.  When  dealing  with  point  features,  match  re¬ 
gions  have  circular  cross  sections  in  planes  of  constant  0,  we  call 
the  projection  of  these  circular  cross  sections  of  match  regions 
onto  the  u  -  v  plane  match  circles.  In  the  u  -  v  plane,  match 
circles  follow  circular  paths  as  0  varies,  of  radius  |m|  centered  at 
d,  given  by  ttJ  =  dj  -  for  feature  match  <  m,,dj  >. 

Consider  the  projection  of  a  slice  of  parameter  space  at  con¬ 
stant  0  =  0O  onto  the  u  -  v  plane.  Assume  that  0O  is  the  cor¬ 
rect  rotation  we  seek,  aligning  the  orientations  of  the  correctly 
matched  model  and  image  features.  Let  to  represents  the  correct 
translation  in  u  -  v  plane.  Then  for  all  correctly  matched  model 
and  image  features,  (to  -  t]  <  R  by  definition  of  the  position  un¬ 
certainty.  Graphically  this  means  that  the  correct  match  circles 
include  the  point  t0. 

6.2  The  Overlap  of  Match  Regions 

For  each  image  feature,  assume  there  is  a  probability  density 
function  characterizing  the  deviations  of  the  measured  feature 
position  from  the  actual  feature  position.  From  this  we  can  de¬ 
rive  a  probability  density  function  fp{p)  characterizing  the  dis¬ 
tribution  of  p  -  1 10  - ■  t]%  the  absolute  distance  in  u  -  v  space 
between  the  correct  translation  and  the  translation  calculated 
from  the  measured  feature.  In  this  analysis  we  will  assume  that 
fp(p)  is  zero  outside  0  <  p  <  R.  It  is  important  to  bear  in 
mind  that  we  are  dealing  with  the  case  where  the  correct  rota¬ 
tion,  0o,  has  already  been  performed.  Now  consider  a  circle  of 
radius  r  =  R  -  p  centered  at  to-  See  figure  4.  This  circle  is  com¬ 
pletely  contained  within  the  match  circle.  Call  this  circle  Cr- 
We  can  easily  derive  the  probability  distribution  of  the  value  r 
from  fp(p):  /r(r )  =  fp(R  r).  Lastly  consider  the  probability 
Pro  that  for  a  particular  feature  match,  the  radius  of  the  little 
circle,  C'r>  is  less  than  some  constant  r0: 


We  will  show  that  a  large  number  of  match  circles  overlap  a  circle 
CTQ  of  some  radius  rp,  and  thus  we  will  find  these  by  sampling 
points  which  fall  inside  ('rQ. 

We  now  calculate  the  probability  that  more  that  no  feature 
matches  each  produce  a  circle  C'T  with  radius  r  <  r0.  Assume 
there  are  *Y  correctly  matched  image  and  model  features.  That 
is,  there  are  ,Y  of  the  model  features  visible  in  the  image.  As¬ 
suming  the  measurement  errors  are  independent,  the  probability 
that  n  of  these  .V  matches  each  produce  a  small  circle,  C r  with 
r  r o  is  given  by  the  binomial  distribution: 


so  the  probability  that  more  that  n0  feature  matches  each  pro¬ 
duce  circles  (  r  with  radius  r  r0  is  then 
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6.3  Sampling  Parameter  Space 

If  /> , „ 0  is  small  for  small  values  of  n0.  then  it  is  likely  that  most 
correct  match  circles  mutually  intersect  over  a  region  C'r  of  radius 
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Figure  4:  A  circle  of  radius  r  centered  at  to  fits  inside  the  match 
circle. 


Figure  5:  As  <p  moves  away  from  the  correct  value  4> o>  the  circle 
Cr  of  radius  ro  centered  at  to  shrinks  to  radius  aro. 

at  least  r0.  If  we  choose  the  translational  sampling  interval  bt 
according  to  ro,  we  can  ensure  that  a  sample  point  will  fall  in 
the  circle  Cr.  Again,  we  have  assumed  that  rotation  by  the 
correct  amount,  <t>0,  has  been  performed.  We  now  consider  the 
structure  of  parameter  space  as  <j>  moves  away  from  the  correct 
value  4>  ~  4>q.  The  velocity  at  which  a  match  circle  moves  in  the 
u  -  v  plane  as  4>  varies  is  proportional  to  |m|,  the  distance  of  its 
center  from  the  center  of  rotation. 

We  can  bound  this  velocity  as  follows:  If  the  maximum  di¬ 
mension  of  the  image  is  /,  then  we  can  shift  the  model  features 
such  that  |m|  <  for  all  model  features.  In  this  case  the 
maximum  velocity  with  respect  to  <f>  of  a  match  circle  is 
rad  1 .  Thus,  the  circle  Cr  of  radius  r0  which  probably  intersects 
at  least  N  -  n0  match  regions  can  shrink  in  radius  no  faster  than 
the  match  circles  forming  its  boundary  can  move. 

0.3.1  Sampling  Intervals 

Say  we  are  willing  to  let  Cr  shrink  to  radius  or0  where  0  <  n  <  \. 
In  this  case,  as  we  vary  <t>  from  <Pq ,  as  long  as 
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Cr  will  have  radius  at  least  or0.  See  figtire  5. 

This  allows  us  to  define  the  rotation  sampling  interval  bd>  and 
the  translation  sampling  interval  bt,  in  terms  of  ro  and  a.  For  in 
order  to  he  sure  we  sample  within  a  circle  of  radius  <>r0  we  must 
have 
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7  Discussion 

There  are  some  strong  assumptions  we  are  making  in  this  analy¬ 
sis.  First,  that  the  deviation  of  the  measured  position  of  an  image 
feature  is  randomly  distributed,  and  that  I  lie  magnitude  of  this 
deviation  is  characterized  by  /„(p).  Second,  that  these  events 


are  independent  of  each  other  for  different  image  features.  It  is 
possible  that  the  error  in  position  measurement  is  systematic,  or 
that  the  error  in  measurement  for  one  feature  is  correlated  with 
that  of  another.  Note,  however,  that  the  dependence  on  the  ac¬ 
tual  form  of  the  distribution  /r(r)  is  minimized  since  we  deal 
with  its  integral  PrQ ,  thus  we  need  only  consider  large  classes  of 
functions  /r(r). 

Nonetheless,  we  can  view  this  analysis  as  a  counting  argu¬ 
ment.  The  probabalistic  analysis  facilitates  characterization  of 
the  space  of  all  possible  error  configurations,  where  a  point  in 
this  space  is  given  by  a  set  of  position  errors,  one  for  each  image 
feature  correctly  matched.  The  probabalistic  analysis  determines 
the  fraction  of  all  possible  error  configurations  for  which  we  can 
be  certain  not  to  miss  the  event  in  parameter  space  we  seek  when 
sampling  it.  Note  that  this  analysis  provides  a  very  loose  lower 
bound  on  bt  and  b(p  since  we  assumed  the  worst  possible  behavior 
of  the  circle  CrQ. 

7.1  An  Example 

Assume  that  the  measured  position  of  an  image  feature  is  equally 
likely  anywhere  within  a  distance  R  of  the  true  position.  The 
probability  that  the  measured  features  falls  within  a  distance  po 
of  the  true  position  is 
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Let  n0  -  b ;V ,  r0  JR  and  o  be  as  above  where  0  v  n,  J,  6  ^  1. 

As  a  specific  example,  let  b  /3  -  .159,  and  .V  ~  30, 

so  »0  bN  -  15.  Let  ft  _  5,  so  r0  -  Jft  .795.  Thus 
Pro  J(2  (i)  .293.  Finally,  P,„0  .005. 

Here  we  see  that  over  99(/<  of  all  possible  error  configurations 
result  in  a  rirrle  (r  of  radius  ro  .795  which  is  contained  in 
more  than  ^  match  circles,  at  the  correct  rotation. 

Say  the  image  size,  /,  is  256.  Picking  <*  ij  to  minimize  the 
density  of  sample  points  in  parameter  space  we  have  b<p  .084 
degrees,  and  bt  -  .75 

These  numbers  result  from  the  assumption  that  the  measured 
positions  of  image  features  are  uniformly  distributed,  in  which 
case  the  probabalistic  analysis  is  analogous  to  a  counting  ar¬ 
gument.  Note  that  this  is  an  extremely  unfavorable  distribu¬ 
tion  f„[p).  We  would  expect  that  the  probability  is  high  that 
p  is  small.  Thus  with  a  more  reasonable  distribution  we  expect 
larger  sampling  intervals.  As  an  example  of  a  more  likely  dis¬ 
tribution  of  the  position  error,  assume  that  /,,{p)  a  truncated 
half-gaussian  with  standard-deviation  at,  with  f„{p)  0 

outside  0  p  ft  and 
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inside  the  interval.  In  this  case  P  .013,  bt 

bo  .28  degrees,  where  we  used  '.  .i  .535.  o 
/  256. 
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7.2  Extended  Features 


8.2  The  Parallel  Nature  of  This  Formulation 


The  argument  for  point  features  also  works  for  extended  features 
such  as  line  segments.  For  a  particular  feature  match  at  a  par¬ 
ticular  rotation  there  is  a  region  of  possible  translations.  The 
length  of  this  region  is  a  function  of  the  position  uncertainty  and 
the  difference  in  the  lengths  of  the  model  and  image  features. 
An  equivalent  situation  is  given  when  we  subtract  the  length  of 
the  image  feature  from  both  the  model  and  the  image  features. 
Assuming  the  image  feature  is  Lss  than  or  equal  in  length  to  the 
model  feature,  we  then  have  an  image  feature  which  is  a  point 
paired  with  a  model  feature  which  is  either  a  point  or  a  line,  [t 
is  easy  to  see  that  in  this  case  the  amount  of  overlap  of  match 
regions  is  equal  to  or  greater  than  the  case  where  only  point 
features  are  involved.  Thus,  the  point  feature  analysis  provides 
a  lower  bound  on  the  performance  of  the  technique  when  using 
extended  features. 


8  Implementation 

In  this  section  we  discuss  the  implementation  of  this  recognition 
procedure  for  a  particular  model  of  parallel  computation.  Be¬ 
cause  of  the  local  nature  of  many  of  the  computations,  the  sim¬ 
plicity  of  the  few  global  computations  required,  and  the  modest 
processor  requirements,  this  recognition  procedure  is  efficiently 
implemented  on  a  parallel  architecture. 

8.1  Background 

The  model  of  parallel  computation  we  use  is  the  Exclusive 
Read  Exclusive  Write  (EREW)  Parallel  Random  Access  Machine 
(PRAM)  model[15j.  We  restrict  the  PRAM  to  Single  Instruction 
Multiple  Data  (SIMD)  since  we  do  not  require  the  more  general 
Multiple  Instruction  Multiple  Data  (MIMD)  model. 

Before  we  discuss  the  specific  aspects  of  the  algorithm  we  need 
to  be  familiar  with  some  basic  parallel  operation.  Two  of  what 
we  might  call  primitive  parallel  operations  are  scan  operations 
and  permutation  (parallel  exclusive  read  and  write).  There  are 
a  number  of  simple  operations,  constructed  with  primitive  oper¬ 
ations,  that  are  useful  such  as  sorting.  Scan  operations,  sorting, 
and  parallel  reads  and  writes  will  play  a  key  part  in  the  imple¬ 
mentation  of  this  recognition  technique. 

Conceptually  we  might  view  the  parallel  machine  as  linear  vec¬ 
tor  of  processors  and  memory [16].  Permutation  of  data  elements 
in  this  vector  is  equivalent  to  exclusive  read  and  write  operations. 
In  this  model  these  operations  take  0(  1)  time.  Probably  less  fa 
miliar  to  the  reader  are  the  scan  operations,  which  are  based 
on  the  application  of  binary  associative  operators  over  a  set  of 
values.  Strictly  speaking,  in  the  EREW  model  the  scan  opera 
tions  can  be  done  in  0(lgA/)  time  on  A/  processors,  although 
the  scan  operation  could  he  considered  a  primitive  operation  re 
quiring  0(1)  timeiUi.  As  an  example,  the  scan  operation  can  be 
used  to  distribute  a  value  from  one  processor  to  several  others, 
and  to  compute  the  cumulative  sum,  maximum,  and  minimum 
of  values  represented  in  a  set  of  processors.  The  scan  is  used 
extensively  in  the  implementation  of  this  recognition  procedure. 
Finally,  sorting  of  values  stored  in  a  set  of  processors  can  be  ac 
complished  in  0(lg2  A/)  time  for  A/  values  !  .  These  operations 
are  the  basic  tools  with  which  we  build  the  parallel  recognition 
procedure. 


All  of  the  operations  required  to  implement  this  procedure  are 
fast  in  parallel.  Furthermore,  the  number  of  processors  required 
is  linear  in  mn,  the  number  of  feature  matches.  The  fundamental 
idert  of  the  parallel  implementation  is  that  each  processor  manip¬ 
ulates  one  data  structure.  For  example,  throughout  most  of  the 
computations,  each  processor  has  a  representation  of  one  model 
feature,  and  one  data  feature,  which  form  a  particular  match. 
This  processor  will  perform  operations  on  these  features  such  as 
finding  the  difference  in  their  lengths,  rotating  and  translating 
the  mode)  feature,  etc.  These  are  all  simple  0(1)  time  opera¬ 
tions,  and  are  performed  on  all  processors  at  once. 

In  section  4.1  we  presented  an  outline  of  the  algorithm.  In  this 
section  we  will  consider  the  implementation  of  these  steps  on  the 
EREW  model. 

•  Form  all  matches  ■  m,d  of  model  and  image  features 

Starting  with  each  model  feature  represented  one  per  processor, 
and  each  image  feature  one  per  processor,  forming  all  pairs  of 
model  and  image  features,  {nij}  x  {d,},  requires  0(lg  mn)  time, 
and  is  accomplished  using  write  and  scan  operations  by  replicat¬ 
ing  each  image  feature  m  times,  and  each  model  feature  n  times 
and  reordering  the  features  so  all  matches  are  formed,  one  per 
processor. 

•  Compute  the  range  of  valid  relative  rotations  for  each  match 

•  Determine  the  set  of  rotation  sample  points  in  this  range  of  ro¬ 
tations 

•  For  each  rotation  sample  point,  compute  the  range  of  valid  rel¬ 
ative  translations 

•  For  each  rotation  sample  point,  determine  the  set  of  translation 
sample  point  in  this  range  of  translations 

This  is  done  by  spreading  the  work  over  several  processors.  First, 
for  each  feature  match,  the  set  of  valid  rotation  sample  points  is 
calculated,  and  the  match  is  replicated  in  a  new  set  of  proces¬ 
sors,  one  copy  per  rotation  sample  point,  forming  a  set  of  rotated 
matches.  Then,  each  new  processor  representing  a  rotated  match 
actually  rotates  the  mod,d  feature  according  to  the  rotation  sam¬ 
ple  point,  and  calculates  the  set  of  translation  sample  points  for 
its  particular  rotation.  The  rotated  matches  are  then  replicated, 
one  copy  per  translation  sample  point  per  processor. 

The  result  is  that  each  original  match  <  m,,dj  >  is  repre¬ 
sented  in  TtJ  seperate  processors,  one  per  transformation  sample 
point  where  7 \j  is  the  number  of  transformation  sample  points 
falling  in  its  match  region.  The  processes  of  calculating  these 
set s  requires  0(1)  t  ime.  and  (list  ribut  ing  t hem  one  per  processor 
requires  0(lg(Tmn))  time. 

•  For  each  complete  parameter  space  sample  point  thus  con¬ 
structed,  compute  the  value  of  F(0.  u.  r)  at  the  point 

Calculating  u.r)  for  each  match  takes  0(1)  time.  To  cal¬ 

culate  F(<.\  «,c)  in  the  case  were  this  consists  of  the  sum  of  the 
f,j[  «.*»,  u,  r )  requires  first  sorting  the  matches  according  to  their 
associated  t rausfonuat ion  sample  points,  and  arranging  them  in 
contiguous  groups  of  processors  with  the  same  transformation 
sample  point,  F'(o.  a,  r).  for  each  transformation  sample  point,  is 
computed  hv  performing  a  plus  scan  of  the  u.  r)  for  matches 

instanciaf ed  at  the  same  transformation  sample  point.  The  sort 
and  scan  require,  respectively.  0(lg‘(/'mu))  and  0(lg( T mn)) 
t  ime. 

•  S«‘l»*c  1  l In*  transformation  sample  points  with  optimal  values  of 
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This  final  step  is  done  by  perforating  a  simple  global  maximum 
operation  on  the  values  of  F{4>,  a,  v)  for  each  transformation  sam¬ 
ple  point  (0,  The  optimal  value  of  u,  e)  can  be  found 

in  parallel  in  0(lg(Tmn))  time,  giving  the  best  transformation. 

It  is  interesting  to  note  that  the  final  operations  are  similar  to 
a  histograniming  procedure  where  rather  than  explicitly  repre¬ 
senting  each  bin,  we  sort  the  items  to  be  histogrammed  according 
to  the  bin  they  fall  into,  and  use  a  scan  to  determine  the  number 
that  fell  in  each  occupied  bin.  Only  the  occupied  bins  are  repre¬ 
sented,  so  the  number  of  possible  bins  can  he  very  large,  as  long 
as  only  a  reasonable  number  are  ever  occupied. 

Thus  we  see  how  the  recognition  procedure  is  implemented. 
The  procedure  requires  0(lg2{7'/un))  time  and  O(Tmn)  proces- 


8.2.1  Pre-Pruning  of  Parameter  Space 

Although  this  method  requires  0(Tmn)  processors,  in  practice 
T  can  be  large.  For  example,  in  figure  1  T  -  78  while  m  42 
and  n  —  269.  Thus  in  practice,  we  might  have  106  transfor¬ 
mation  sample  points  to  consider,  with  only  64K  processors  to 
use.  One  approach  is  to  simply  iterate  over  sets  of  matches  and 
transformation  sample  points,  using  all  available  processors  at 
each  iteration. 

We  employ  two  absolute  thresholds  to  eliminate  unlikely  trans¬ 
formations:  a  minimum  on  the  fraction  of  the  model  perimeter 
explained  by  the  data,  and  a  minimum  on  the  number  of  fea¬ 
ture  matches  valid  at  a  transformation  sample  point.  The  idea  is 
that  there  are  many  regions  of  transformation  parameter  space 
in  which  only  a  few  matches  fall. 

To  implement  this,  a  coarse  translational  Hough  transform 
is  computed  for  each  match,  at  each  of  its  associated  rotation 
sample  oints.  This  is  much  faster  than  full  translation  sam¬ 
pling  and  allows  us  to  eliminate  unpopular  transformations  be¬ 
fore  we  consider  them  further.  For  example  in  figure  1,  there 
were  over  12,000  rotated- matches  falling  alone  in  large  regions  of 
translation  space.  Just  eliminating  these  allows  us  to  disregard 
wl  x  12,000  ^  150,000  translation  sample  points,  where  wl  is  the 
number  of  translation  sample  points  for  a  given  rotation  sample 
point.  We  can  he  sure  that  under  the  criterion  that  a  minimum 
amount  of  the  model  must  be  visible  that  we  are  not  eliminating 
any  correct  hypotheses. 


9  Experiments 

9.1  Connection  Machine  Implementation 

The  system  implemented  is  a  simple  demonstration  of  this 
method.  For  simplicity  in  the  initial  implementation,  only  a 
single  instance  of  a  single  model  is  present  in  the  image.  In  the 
next  section  we  will  show  why  the  algorithm  works  with  multiple 
instances  of  a  single  model  as  well  as  a  small  number  of  different 
models  with  multiple  instances. 

We  define  »,  r )  length(</,)  inside  the  match  region 

and  0  outside.  The  best  transformation  is  chosen  as  that  with  the 
maximum  value  of  [-'($.  u,v)  -  ^  /^{O, »/,  r)  without  further 

reasoning.  We  would  not  build  a  complete  recognition  algorithm 
in  this  wav,  hut  this  serves  to  demonstrate  the  idea. 

The  Connection  Machine  I  IT.  or  CM  I.  is  a  parallel  computer 
based  on  a  hypercube  network,  which  simulates  reads  and  writes 
on  an  M  processor  EREW  machine  in  0(!gA/)  time  with  high 


Figure  6:  n  41,  ni  241,  bd>  -  2  deg,  bt  -  2  pixels,  0-6 
deg,  R  5  pixels,  T  144 

probability.  Each  processor  has  approximately  4K  hits  of  local 
memory,  and  is  capable  of  general  computation. 

The  interface  to  the  CM-l  is  through  a  Symbolics  3600  series 
computer,  to  which  special  imaging  hardware  is  attached  facil¬ 
itating  the  loading  of  image  data  into  the  CM-l.  The  entire 
recognition  procedure  consists  of  loading  the  data  image  into  the 
CM-l,  extracting  intensity  edges,  forming  a  polygonal  approxi¬ 
mation  of  the^e  edges  to  form  image  features,  and  then  matching 
these  against  previously  stored  mod*1!  features. 

A  parallel  implementation  of  the  edge  detector  developed  by 
Crtnny'2'il'  is  used  to  extract  intensity  edges.  The  polygonal 
approximation  technique  employed  also  exploits  parallelism,  hut 
is  an  extremely  simple  technique  consisting  of  recursive  splitting 
of  straight  line  approximations  to  the  edge  curves  until  some 
error  hound  is  achieved. 

The  system  was  implemented  on  a  quarter-sized  CM-l.  con¬ 
sisting  of  IfiK  processors.  1  he  full  machine  has  64K  processor, 
however  1  lie  quarter  machine  can  simulate  the  full  machine  with 
verv  close  to  a  factor  of  1  slowdown.  Thus  we  assume  that  we 
have  implemented  this  procedure  on  a  1  K  processor  machine. 

9.2  Results 

file  initial  implementation  lias  l  •  n  run  on  verv  heavily  occluded 
input  scenes.  Example.'  of  typical  images  are  shown  in  figure  6. 
figure  7.  and  figure  N.  The  values  of  the  parameters  used  for 
these  examples  are  shown.  1  he  implementation  has  not  been 


Figure  7:  n  -  31,  m  -  259,  bo 
deg,  R  =  5  pixels,  T  121 


2  ueg,  bt  -  2  pixels,  0-6 


Figure  8:  n  -  10,  in  -  225,  bo 
deg.  R  5  pixels,  T  116 


2  deg,  bt  -  2  pixels,  0-6 


optimized  for  speed,  and  there  is  room  for  speedup.  However, 
the  transformation  sampling  portion  of  the  algorithm  runs  ap¬ 
proximately  5  seconds  on  a  16K  processor  machine,  or  approx¬ 
imately  1  second  on  a  full  61 K  processor  machine.  The  Canny 
edge  detector  runs  in  approximately  .3  seconds,  and  the  polygo¬ 
nal  approximation  procedure  runs  in  about  2  seconds  on  a  I6K 
machine. 


10  Extensions 

10.1  Multiple  Models,  Multiple  Instances 

The  Algorithm  and  the  resulting  recognition  system  is  more  gen¬ 
era'  than  wp  have  initially  described.  The  algorithm  can  recog¬ 
nize  multiple  instance  of  an  object  in  the  image.  To  see  this, 
note  that  after  the  transformation  sampling  procedure  we  have 
available  hundreds  of  different  hypothesized  transformations,  all 
of  which  are  globally  consistent,  valid  interpretations  of  the  im¬ 
age  data.  To  find  multiple  instances  of  a  model,  all  the  work  is 
already  done  and  all  we  need  to  do  i<  pirk  those  other  transfor¬ 
mations  which  meet  the  minimal  criteria. 

Since  the  model  is  just  a  collection  of  features,  we  could  simul¬ 
taneously  recognize  2  or  more  objects  by  simply  including  tin* 
new  model  features  with  the  original  set.  All  that  is  necessary  is 
that  we  keep  the  calculations  for  each  different  model  separate 
from  one  another.  This  is  trivial  to  do  in  this  impicmentat ion. 


10.2  Refining  *he  Transformation 

For  demonstration  purposes,  in  this  implementation  we  have  re¬ 
lied  on  one  stage  of  transformation  sampling  for  the  entire  recog¬ 
nition  engine.  This  provides  a  good  approximation  to  the  best 
possible  transformation.  A  two  stage  procedure  could  he  used 
to  refine  these  transformations.  In  the  first  stage  we  sample 
space  as  coarsely  as  possible  provided  that  we  can  he  sure  to 
find  approximately  the  correct  feature  mapping  and  associated 
transformation.  We  can  then  refine  a  small  number  of  hypothe¬ 
sized  transformations  further.  The  advantage  of  this  multi-stage 
approach  is  reduction  in  computational  complexity;  the  number 
of  sample  points  that  need  he  initially  considered  i>  smaller. 

The  refinement  stage  could  he  a  second  iteration  of  the  trans¬ 
formation  sampling  technique  concent  rat  ing  only  on  a  finer  grid 
of  sample  point  >  wit  bin  one  interval  of  t  he  original  sampling  grid, 
and  near  a  h  vpot  hesi/ed  transformation  from  the  first  stage.  An¬ 
other  possibility  is  a  parallel  gradient  based  search  of  the  trans¬ 
formation  parameter  space,  utilizing  appropriate  definitions  of 
the  functions  F[o.  «,»'),  and  fl([0,u.  r)  which  would  produce  a 
smooth  and  continuous  search  space. 

10.3  Model  Libraries  and  3D  Recognition 

Ibis  robust  recognition  technique  was  not  intended  to  handle 
large  libraries,  but  rather  to  be  able  to  recognize  a  small  number 
of  possible  objects.  I  he  system  would,  however,  provide  a  second 
stage  for  robust  verification,  following  an  indexing  first  stage 
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Figure  9:  The  models. 

which  selects  a  small  set  of  candidate  models  from  a  much  larger 
library  of  models. 

Lastly  we  consider  extensions  to  3D  recognition  from  3D  range 
data.  We  cannot  directly  apply  this  procedure  to  3D  feature 
matches  since  there  is  no  unique  transformation  aligning  two 
lines  in  3D.  We  can,  however,  introduce  constraint  by  considering 
pairs  of  model  features  with  pairs  of  image  features,  and  then  use 
transformation  sampling  as  before,  except  in  a  6D  parameter 
space.  The  complexity  of  the  procedure  rises  dramatically  in 
this  case  due  to  the  additional  degrees  of  freedom,  however  in 
principle  the  technique  will  work. 

11  Conclusion 

We  have  introduced  transformation  sampling  as  a  technique  for 
2D  model-based  recognition,  and  given  a  formal  basis  for  the 
technique  and  the  determination  of  the  appropriate  sampling 
intervals.  The  technique  offers  several  advantages  to  existing 
techniques  based  on  the  Hough  transform. 

One  of  the  main  goals  was  to  produce  a  recognition  algorithm 
suitable  for  parallel  implementation.  The  algorithm  developed 
not  only  provides  an  effective  recognition  technique,  hut  is  highly 
parallel  in  nature. 

We  have  described  and  demonstrated  an  implementation  of 
a  recognition  system  for  large  SIMD  parallel  computers.  The 
processor  requirements  are  O(Tmn),  linear  in  the  number  of 
pairs  of  model  and  data  features,  and  the  time  requirements 
O (\%2 (T rnn )).  The  system  is  capable  of  determining  a  very  close 
estimate  of  the  transformation  which  aligns  a  model  with  its  in¬ 
stance  in  an  image. 
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Abstract 

The  power  of  object  recognition  features  as  a  function  of  view¬ 
point  is  explored.  A  prior  system  for  model-based  recognition 
is  examined,  using  the  geometries  of  the  relevant  match  feature 
to  define  a  viewpoint  based  error  function.  General  rules  for  de¬ 
termining  degenerate  viewpoints  of  a  feature  are  proposed.  The 
uses  of  the  function  in  selecting  promising  match  features  is  dis¬ 
cussed.  Finally,  results  of  the  recognition  system  are  presented 
along  with  an  analysis  of  the  relevant  error  values,  demonstrat¬ 
ing  the  ability  of  the  error  metric  to  predict  the  reliability  of 
features  at  arbitrary  viewpoints  and  to  derive  a  minimal  set  of 
effective  features. 

1  Introduction 

1.1  Previous  Work 

The  area  of  model-based  vision  is  receiving  increased  accep¬ 
tance  as  a  practical  approach  to  the  recognition  of  objects  in 
unconstrained  intensity  images  [Lowe],  [Ayache  and  Faugeras], 
[Huttenlocher  and  Ulmann],  [Crimson  and  Lozano- Perez], 
[Ikeuchi],  [Thompson  and  Mundy].  The  central  mechanism  is 
the  use  of  a  three  dimensional  geometric  model  to  constrain 
the  analysis  of  a  two  dimensional  intensity  image  projection. 
The  recognition  of  an  object  involves  the  determination  of  an 
appropriate  match  between  the  model  and  its  projection.  The 
consistency  of  the  projection  transformation  is  tbe  main  crite¬ 
rion  for  the  validity  of  feature  assignments  between  the  object 
model  and  the  segmentation  of  the  intensity  projection. 

The  image  data  is  segmented  into  boundary  curves  and 
regions  which  are  then  represented  as  geometric  entities.  A 
grouping  step  is  required  to  form  a  sufficient  number  of  con¬ 
straints  to  determine  the  transformation  between  the  object 
model  and  the  image.  For  example,  Huttenlocker  uses  three 
image  point  correspondences  to  establish  the  affine  transforma¬ 
tion  between  the  model  and  the  image  coordinate  frames.  Lowe 
uses  grouping  of  parallel  lines  and  significant  edge  junctions  to 
index  feasible  model  assignments.  Crimson  and  Lozano- Perez 
as  well  as  Ayache  and  Faugeras  use  a  cost-directed  sequential 
search  to  establish  consistent  model  assignments  for  edge  and 
vertex  features.  Ikeuchi  uses  characteristic  views  [f'hakravarty] 

'This  work  was  support's!  in  part  try  the  DAUPA  Strategic  Comput¬ 
ing  Vision  Program  in  conjunction  with  the  Army  Knginccr  Topographic 
Laboratories  under  Contract  No.  DACA76-86-C-0007. 


to  index  into  a  decision  tree  that  is  automatically  generated 
from  a  solid  object  model.  The  characteristic  view  features  are 
defined  in  terms  of  unique  region  shapes  and  junction  configu¬ 
rations. 

An  overriding  theme  of  all  of  this  work  is  that  parameters 
of  geometric  features  which  are  extracted  from  the  image  data 
support  the  computation  of  the  transformation  between  the  im¬ 
age  and  model  coordinate  frames.  It  thus  becomes  an  impor¬ 
tant  issue  to  examine  the  errors  associated  with  this  computa¬ 
tion  and  the  resulting  effect  on  the  validity  of  feature  matches 
and  consequent  object  recognition  performance.  Ikeuchi  and 
Kanade  have  already  begun  to  explore  the  effectiveness  of  var¬ 
ious  types  of  sensors  in  terms  of  viewing  geometry  and  feature 
strength  [Ikeuchi  and  Kanade]. 

We  will  pursue  this  question  in  the  context  of  a  particular 
model  matching  approach  which  we  have  been  developing  over 
the  past  several  years.  Our  approach  is  based  on  the  obser¬ 
vation  that  the  affine  projection  transformation  can  be  deter¬ 
mined  from  just  two  vertices  and  the  edge  directions  associated 
with  one  of  the  vertices  [Thompson  and  Mundv]. 

1.2  The  Vertex-Pair  Feature 

The  vertex-pair  geometry  is  shown  in  Figure  1.  The  line  seg¬ 
ment  connecting  the  two  vertices  is  called  the  spine.  This  seg¬ 
ment  merely  indicates  the  grouping  of  the  two  vertices  and  does 
not  actually  have  to  exist  in  the  segmentation  or  the  model.  We 
choose  one  of  the  vertices  to  be  the  base  vertex  and  define  the 
angles  between  two  edges  intersecting  at  the  vertex  and  the 
spine  as  a  and  0. 

The  main  advantage  of  this  image  feature  is  that  only  two 
image  vertices  need  to  be  grouped  in  order  to  establish  the  full 
six  parameter  transformation.  The  edges  associated  with  one 
of  the  vertices  are  naturally  available  since  image  vertices  are 
defined  in  terms  of  the  intersection  of  boundary  edge  segments. 
The  pair-wise  vertex  grouping  leads  to  an  algorithm  which  has 
complexity  proportional  to  M ( A  -  l)2  where  iX  is  the  number 
of  image  segmentation  vertices,  and  M  is  t  he  number  of  vertex- 
pairs  defined  in  the  model. 
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3D  Vertex  Pair 


Figure  1:  The  three  dimensional  vertex-pair  and  its  affine  pro¬ 
jection. 

More  complex  features  become  exceedingly  expensive;  the 
cost  increases  exponentially  with  the  number  of  primitive  fea¬ 
tures  in  the  group.  We  also  observe  that  the  fragility  of  current 
segmentation  algorithms  favors  approaches  that  do  not.  require 
the  reliable  extraction  of  complex  features  such  as  trihedral 
junctions  or  well  formed  object  face  projections. 

In  our  approach,  all  of  the  possible  assignments  are  made 
between  the  object  mode]  vertex-pairs  and  the  feasible  vertex- 
pain:  in  the  image  segmentation.  This  exhaustive  approach  is 
based  on  the  view  that  there  is  no  reliable  way  of  throwing 
out  assignments  outside  of  the  context  of  the  t  ransform  coher¬ 
ence  that  is  generated  by  the  matching  process  itself.  However, 
we  do  eliminate  vertices  that  do  not  have  incident  edges  that 
are  long  enough  to  establish  accurate  projection  parameters, 
in  order  to  reduce  the  computational  complexity  as  much  as 
possible.  In  current  experiments,  we  typically  consider  edges 
shorter  than  5*  10  pixels  to  be  too  short  to  provide  an  accurate 
transformation. 

Kach  vertex- pair  assignment  leads  to  a  transformation  be¬ 
tween  the  model  and  its  proposed  image  projection.  In  our 
approach,  these  transformations  are  clustered  in  the  transform 
parameter  space  to  determine  valid  assignments.  This  validity 
is  based  on  the  concept,  of  viewpoint  consistency  [Lowe],  which 
is  the  observation  that  all  points  of  an  object  will  project  into  an 
image  with  the  same  transformation  mapping  function.  T  ims, 
if  one  has  a  correct,  set  of  assignments,  then  the  transforma 
tions  computed  from  those  assignments,  should  all  agree.  The 
clustering  concept,  is  illustrated  schematically  in  Figure  2. 

We  have  carried  out  a  number  of  experiments  in  using  the 
vertex  pair  matching  system  to  recognize  objects  in  outdoor 
scenes,  including  some  experiments  with  infra  red  data  from 
the  April  l?W.r>  IJ.S.  air  raid  on  Libya  which  we  discuss  later. 
In  general,  the  results  have  been  quite  encouraging  in  terms  of 
the  ability  of  the  algorithm  to  deal  with  <  lutter  and  occlusion. 

An  example  of  this  matching  process  is  given  for  the  image 
of  the  (IK  Research  and  Vvelop:.  i* ..  Center  in  Figured.  I  • 
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Figure  2:  Tlie  flustering  of  transform  values  to  establish  valid 
model  feature  assignments.  The  figure  shows  the  distribution 
of  transform  values  in  a  three  dimensional  subset  of  the  six 
dimensional  affine  transform  space. 


Figure  d:  A  view  of  the  (IF  Hesearch  and  Development  Center. 


segmentation  for  this  image  is  shown  in  Figure  •!.  The  mode! 
for  a  wing  of  the  H\T)  Center  is  shown  in  Figure  V  The  match 
bet  ween  this  model  and  the  imago  data  is  shown  in  Figure  <> 
lh  ese  experiments  have  indicated  the  need  for  further  de 
velopmejits  to  account  for  the  errors  associated  with  the  place 
merit  of  features  in  the  image  segmentation  and  the  elfert  of 


& 
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Figure  4:  The  segmentation  into  edges  and  vertices. 


Figure  5:  A  model  of  the  west  wing. 


these  errors  on  clustering.  We  now  consider  these  issues  in 
more  detail. 

1.3  Weaknesses  of  the  Current  Approach 

It  is  clear  that  the  success  of  the  clustering  process  is  crucially 
dependent  on  the  accuracy  with  which  the  transformation  can 
be  determined  from  each  vertex-pair  assignment;  we  depend  on 
having  a  small  cluster  radius  threshold  in  transform  spare  to 
filter  out  incorrect  matches. 

In  our  algorithm  we  depend  on  accurate  locations  for  the 
vertices  and  accurate  angles  between  the  direction  of  the  spine 
of  the  vertex-pair  and  the  other  two  edge  directions.  For  some 
viewpoints,  the  uncertainty  in  these  feature  parameters  can  be 
amplified  in  the  transformation  parameters  which  give  the  three 
dimensional  location  and  orientation  of  the  object  model  with 
respect  to  the  image  coordinate  frame. 


Figure  6:  The  resulting  match  of  the  model  determined  by  the 
vertex-pair  algorithm. 

The  current  implementation  attempts  to  provide  for  this 
uncertainty  in  the  transform  values  by  including  all  transform 
solutions  which  would  correspond  to  an  assumed  spread  in  the 
image  feature  parameters.  For  example,  we  assume  that  the 
projected  angles,  a  and  /3,  can  be  in  error  by  5°  due  to  un¬ 
certainty  in  the  image  segmentation  process.  This  propagated 
uncertainty  in  the  model  transformation  can  lead  to  a  range  of 
solutions  for  the  model  match.  Each  solution  corresponds  to  a 
slightly  different  value  for  the  rotation  parameters  of  the  model 
transformation. 

The  relationship  between  image  segmentation  errors  and 
the  model  transformation  uncertainty  will,  of  course,  depend  on 
the  viewpoint.  It  is  also  the  case  that  degeneracies  will  exist  in 
the  viewing  projection.  In  particular  transformation  equations 
become  degenerate  for  viewpoints  which  are  collinear  with  the 
edges  and  spine  of  the  vertex-pair. 

In  the  current  implementation,  the  model  vertex-pairs  are 
manually  selected  with  an  interactive  graphics  editor  which  op¬ 
erates  on  the  object  model.  There  is  no  guarantee  that  the 
vertex-pairs  selected  in  this  manner  will  have  a  good  error  per¬ 
formance.  That  is,  it  is  essential  that  a  set  of  vertex-pairs  be 
selected  which  provide  as  accurate  an  estimate  as  possible  of 
the  model  transform  parameters,  over  all  viewpoints.  It  is  thus 
desirable  that  we  extend  the  implementation  as  follows: 

•  Establish  an  error  metric  for  the  mode!  transformation 
which  can  serve  as  a  cost  function  in  optimizing  the  se¬ 
lection  of  model  vertex-pair  features. 

•  Clarify  the  nature  of  viewing  degeneracies  for  the  vertex- 
pair  feature,  so  that  these  viewpoints  can  be  avoided  in 
the  clustering  process. 

•  Implement  an  optimization  algorithm  to  automatically 
select  vertex- pairs. 

1  he  remainder  of  the  paper  is  devoted  to  a  discussion  of 
these  issues  and  will  include  some  experimental  results  obtained 
for  an  error  metric  which  appears  to  satisfy  the  criteria  we  just 
stated. 
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Figure  7:  The  tip-tilt  transformation  of  the  vertex-pair  with 
respect  to  the  x-y  plane. 

2  The  Affine  Image  Transformation 

2.1  Computing  the  Transformation 

In  a  previous  reference  [Thompson  and  Mundy],  we  demon¬ 
strated  that  the  affine  transformation  is  an  effective  approx¬ 
imation  to  the  general  perspective  case.  In  summary,  the  affine 
tranformation  is  characterized  by  an  orthographic  projection 
with  a  scale  factor.  Thus  the  full  effects  of  rotation  and  viewing 
distance  are  represented  in  the  affine  transform.  A  consequence 
of  the  affine  approximation  is  that  parallel  lines  are  projected 
as  parallel  lines.  Thus,  the  usual  perspective  phenomenon  of 
the  vanishing  point  for  parallel  lines  is  not  predicted  by  the 
affine  approximation.  However,  for  objects  which  are  compact, 
compared  to  the  viewing  distance  this  perspective  distortion  is 
not  large. 

To  clarify  the  following  discussion,  we  briefly  review  the 
computation  of  the  affine  transformation  parameters  from  the 
vertex-pair  and  its  projection  in  the  image  viewplane.  First,  the 
transformation  is  defined  in  terms  of  rotation  and  translation 
parameters.  The  three  dimensional  rotation  is  defined  in  terms 
of  successive  rotations  about  the  x,y,z  coordinate  axes.  We 
refer  to  these  angles  as  0, 0,(f.  The  translation  parameters  are 
x,y  and  s,  where  we  take  the  x-y  plane  to  correspond  to  the 
image  plane.  The  scale  factor,  s,  is  the  ratio  of  the  size  of  the 
object  in  the  image  to  the  actual  projection  of  the  object  model 
onto  the  x-y  plane.  This  scale  factor  is  inversely  proportional 
to  the  distance  from  the  object  to  the  camera. 

The  most  important  step  in  determining  the  transformation 
is  determining  the  tip  and  tilt  angles  of  the  model  with  respect 
to  the  image  plane,  0  and  0.  We  have  shown  that  these  angles 
can  be  determined  directly  from  the  projected  angles  between 
the  vertex  edges  and  the  spine,  a  and  0.  These  angles  are 
defined  in  Figure  1. 

In  our  current  implementation,  we  precompute  the  func¬ 
tional  relationships,  0(a,/3)  and  0i(a,/3),  and  store  them  as 
a  look-up  table  computed  for  each  vertex-pair  defined  in  the 
model.  The  values  of  (a,0)  are  computed  when  an  image 
vertex-pair  is  associated  with  a  model  vertex-pair  and  then 
the  table  is  used  to  retrieve  (0,0).  The  model  vertex-pair  is 
then  transformed  by  the  (0,0)  rotations  about  the  x  and  y 
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Figure  8:  The  determination  of  (  and  x-y  translation. 
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Figure  9:  The  determination  of  the  projection  scale  factor. 

axes  to  establish  its  projection  into  the  x-y  plane.  This  step  is 
illustrated  in  Figure  7.  . 

The  remaining  rotation  angle,  C,  can  be  determined  by  ob¬ 
serving  the  angle  between  the  projected  model  spine  and  the 
actual  spine  direction  in  the  image  segmentation.  Since  the 
tip  and  tilt  angles  have  been  removed,  this  angle  corresponds 
directly  to  (.  This  relationship  is  shown  in  Figure  8. 

The  translations  in  the  x-y  plane  are  also  determined  as 
shown  in  Figure  8.  These  translation  components  are  sim¬ 
ply  computed  from  the  translation  vector,  (p,^),  between  the 
projected  model  base  vertex  and  the  actual  base  vertex  in  the 
segmentation. 

The  final  translation  parameter  is  given  by  the  ratio  of  the 
lengths  of  the  projected  model  spine  and  the  actual  length  of 
the  spine  in  the  image.  This  relationship  is  illustrated  in  Fig¬ 
ure  9. 

In  the  discussion  to  follow  we  focus  on  the  determination 
of  (0,0).  The  error  sensitivity  of  the  other  parameters  of 
the  transformation  depends  mainly  on  foreshortening  and  scale 
which  are  also  functions  of  (0,0).  The  effects  of  viewing  dis¬ 
tance  and  viewing  orientation  are  thus  mixed,  which  leads  to 
the  idea  of  multi-resolution  model  feature  sets.  We  will  con¬ 
sider  this  issue  in  a  later  section. 
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The  key  issue  is  how  the  vertex-pair  is  projected  as  the  view¬ 
point  moves  over  the  surface  of  the  standard  viewsphere,  which 
is  a  spherical  surface  defined  to  represent  the  possible  viewing 
orientations  of  the  image  plane  with  respect  to  the  object.  The 
object  is  considered  to  be  at  the  origin  of  the  sphere  and  the 
points  on  the  surface  of  the  sphere  define  a  vector  orientation 
from  a  given  point  to  .he  origin.  This  vector  corresponds  to  a 
possible  viewpoint  orientation.  A  rotation  about  this  viewing 
axis  will  not  affect  the  nature  of  the  projection  of  the  model 
into  to  the  image  plane,  since  the  image  plane  coordinate  axes 
can  be  redefined  to  compensate.  In  our  approach  to  assignment 
clustering,  we  find  it  convenie  it  to  represent  three  dimensional 
rotations  in  terms  of  sequential  rotations  about  the  axes  of  the 
camera  coordinate  system.  However,  we  can  relate  any  loca¬ 
tion  on  the  viewing  sphere  to  the  ( <j> ,  ip)  rotational  parameter 
subspace. 

The  optimization  problem  thus  becomes  one  of  selecting 
vertex-pairs  so  that  the  viewing  sphere  is  exhaustively  covered. 
That  is,  for  each  point  on  the  viewing  sphere,  there  should  be 
an  adequate  number  of  potentially  visible  vertex-pairs  which 
have  acceptable  error  performance  and  are  not  degenerately 
viewed  for  that  viewing  orientation.  Naturally,  one  can  not 
guarantee  that  a  given  vertex-pair  will  be  visible,  since  it  may 
be  occluded  by  other  objects,  or  by  part  of  the  object  surface 
itself.  In  our  current  implementation,  the  local  self  occlusion 
of  a  vertex-pair  is  taken  into  account  by  not  allowing  occluded 
viewpoints  to  appear  as  solutions  in  the  ( <p ,  1 p)  look-up  tables. 
We  do  not  solve  the  full  hidden  line  problem  to  determine  global 
occlusion,  although  this  is  desirable  for  a  full  solution  to  the 
optimization  problem. 

Next  we  consider  the  definition  of  an  appropriate  error  met¬ 
ric  which  describes  the  uncertainty  in  the  (<j>,  ip )  parameters  as 
a  function  of  {<j>,  ip)  orientation,  or  equivalently,  with  respect  to 
position  on  the  viewsphere. 


3  An  Effective  Viewpoint 


3.1  Image  Segmentation  Error 


In  our  current  approach  to  segmentation  [Canny],  we  rely  on 
zero  crossings  in  the  second  derivative  of  image  intensity  to 
define  the  location  of  geometric  edge  and  vertex  features.  There 
are  many  phenomena  that  can  cause  the  location  defined  by 
the  second  derivative  to  be  in  error  with  respect  to  the  ideal 
location  corresponding  to  the  image  projection  of  a  given  object 
feature.  Some  of  the  more  significant  effects: 


•  Complex  image  intensity  behavior  that  does  not  corre¬ 
spond  to  the  simple  step  edge  model  employed  to  detect 
object  boundaries  (e  g.  corners  and  junctions). 


Boundary  characterized  by  texture,  rather  than  simple 
intensity  discontinuity. 


Quantization  in  image  intensity  and  spatial  resolution. 


•  Random  uncertainty  in  sensor  intensity  values  (Usually  a 
small  effect.) 


These  effects  contribute  to  uncertainty  in  the  detected 
boundary  element  locations.  In  our  approach  these  "edges”  are 
then  linked,  and  the  resulting  boundary  chain  is  approximated 
by  straight  line  segments  [Asada  and  Brady].  This  process  in¬ 
troduces  additional  error  in  the  segmentation  geometry.  Some 
of  the  major  effects  here  are: 


Boundary  chains  with  low  curvature  do  not  have  well 
defined  segment  endpoint  location. 


Image  spatial  quantization  introduces  significant  uncer¬ 
tainty  in  the  chain  curvature  measurement. 


Finally,  even  more  uncertainty  is  introduced  by  the  neces¬ 
sary  extrapolation  of  line  segments  to  form  vertices  between 
endpoints  of  lines  that  should  (ideally)  meet.  This  extension  is 
necessary  because  some  portions  of  the  boundary  are  missing 
due  to  poor  edge  detector  performance  near  junctions  and  in 
the  case  of  low  contrast  boundary  intervals. 

The  cumulative  result  of  these  various  phenomena  in  our 
current  implementation  is  that  edge  orientation  is  accurate  to 
about  5°,  and  vertex  location  is  accurate  to  a  radius  of  sev¬ 
eral  pixels.  These  errors  can  be  much  larger  for  the  case  of 
line  segments  with  lengths  that  are  comparable  to  the  vertex 
uncertainty  (5-10  pixels). 

The  error  in  edge  orientation  is  the  focus  of  our  discussion. 
Next  we  consider  the  relationship  between  this  orientation  error 
and  the  ( <p ,  ip)  parameters. 


3.2  A  Rotation  Error  Metric 


The  equational  systems  that  relate  (a,0)  and  (<p,  ip)  are  quite 
non-linear  and  present  many  special  solution  cases.  In  our  ap¬ 
proach,  these  equations  are  solved  numerically  and  stored  in 
lookup  tables.  It  then  becomes  reasonable  to  compute  a  Tay¬ 
lor  series  expansion  about  a  particular  value  of  (<p,  ip)  to  provide 
a  linear  expression  for  the  parameter  mapping. 

We  assume  that  we  have  computed  the  functions,  a(<p,ip) 
and  /?(</>, ip)  (see  Figure  1).  Then, 


a(<p,ip)  =  o(0o,l/'o)  +  ^(</>- d’o)  +  ^(V>  -  Vto)  (1) 


,  0)  =  0{<po,ipo)  +  t^(<^  ~  ~  ^°)  ( 2 ) 


where  the  indicated  derivatives  can  be  computed  numeri¬ 
cally. 

The  Jacobian,  J,  of  the  parameter  mapping  is  given  by. 


da  da 

Md4> 
djl 

d<t>  dxp 


(3) 


Naturally,  if  J  vanishes,  then  the  mapping  is  not  defined  for 
that  particular  viewpoint  [Whitney].  We  can  solve  the  expan¬ 
sion  equations  for  the  variation  in  (<p,  ip)  as  follows: 


A  <)>  = 


Aip 
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(4) 


(5) 
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Assuming  that  the  errors  in  (a,/J)  are  independent  and  of 
zero  mean,  we  can  derive  an  estimate  for  the  variance  in  the 
Euclidean  distance  between  the  estimate  and  the  mean, 

(<t>o >*Po)-  We  denote  this  squared  distance  by  A  similar 

representation  can  be  defined  for  the  variance  in  the  measured 
projection  angles,  i.e.  17 a, p-  The  ratio  of  these  variances  is 
given  by, 


(If)  +  (lg)  +  (if)  +  (If) 


The  ratio,  is  a  measure  of  the  sensitivity  of  the  com¬ 
puted  model  rotation  parameters  to  errors  in  the  segmentation. 
If  this  ratio  is  much  larger  than  one,  the  variation  in  (<j>,  rp)  be¬ 
comes  too  great  to  provide  an  effective  filter  for  incorrect  model 
assignments.  Typically,  it  is  necessary  to  bound  a  valid  (</>,  VO 
transformation  cluster  with  a  tolerance  radius  of  approximately 
5°,  in  order  to  eliminate  false  assignment  hypotheses.  Thus,  a 
ratio  of  one,  or  less,  is  necessary  to  support  the  application  of 
viewpoint  consistency  with  respect  to  the  rotation  parameters, 
( )■ 

3.3  Effective  Viewing 

We  are  now  in  a  position  to  define  the  concept  of  effective 
viewing.  An  effective  vertex-pair  is  one  which  provides  a  pre¬ 
cise  estimate  of  the  coordinate  transform  between  the  three  di¬ 
mensional  reference  frame  and  the  image  projection  reference 
frame. 

A  single  vertex-pair  cannot  be  effective  over  all  viewpoints. 
There  are  obvious  degenerate  viewing  conditions  where  the 
transformation, 

(a,/3)  ->  (f6,  VO 

is  not  defined.  For  example,  when  the  viewing  direction  is 
collincar  with  the  spine,  the  corresponding  transform  equations 
do  not  have  a  solution. 

The  error  metric  that  we  have  just  defined  predicts  these 
degeneracies  as  illustrated  in  Figure  10.  In  this  figure,  we  have 
projected  onto  the  viewing  sphere.  In  the  left  column,  a 
coplanar  vertex-pair  is  shown  in  the  upper  pane.  The  spine  is 
horizontal  and  the  displayed  viewpoint  is  somewhat  above  the 
plane  of  the  vertex-pair.  One  of  the  edges  is  quite  foreshortened 
at  this  viewpoint.  The  viewsphere  for  this  case  is  shown  in  the 
lower  left  pane.  The  eyepoint  for  this  image  of  the  sphere  is 
the  same  as  that  of  the  vertex-pair  in  the  upper  pane.  The 
intensity  in  this  image  is  proportional  to  the  ratio, -i4.  The 

"an 

error  metric  becomes  high  at  the  equator  of  the  viewsphere, 
since  this  corresponds  to  the  locus  of  viewpoints  that  lie  in  the 
plane  of  the  vertex-pair.  The  edges  and  spine  of  the  vortex- 
pair  all  coilapse  into  a  single  line  for  this  set  of  viewpoints. 
Singularities  exist  at  the  points  the  view  axis  is  rollinear  with 
any  of  the  edges  or  the  spine. 

There  is  also  a  region  of  relatively  high  error  at  the  poles  of 
the  viewsphere.  These  viewpoints  correspond  to  looking  nor¬ 
mal  to  the  plane  of  the  vertex-pair.  This  is  reasonable,  since 


Figure  10:  Two  examples  of  the  error  metric  projected  onto  the 
viewing  sphere.  The  upper  panes  show  the  vertex-pairs  used 
to  generate  the  spheres  shown  in  the  lower  pane. 


the  projected  angles  of  edges  lying  in  a  plane,  have  the  least 
variation  with  tip  and  tilt  of  the  plane  when  the  viewpoint  is 
normal  to  the  plane. 

The  right  column  shows  the  same  calculation  except  that 
the  vertex-pair  is  not  coplanar.  The  first  edge  is  now  inclined 
upward  out  of  the  horizontal  plane  at  45°.  There  are  still  two 
great  circles  of  high  value  on  the  viewsphere.  These  correspond 
to  viewpoints  lying  in  the  two  planes  defined  by  the  spine  and 
each  of  the  edges.  The  error  associated  with  normal  views 
of  each  of  these  planes  now  contributes  to  a  rather  complex 
distribution  over  the  sphere.  The  dark  areas  on  the  sphere 
correspond  to  effective  viewpoints. 

4  The  Composition  of  Multiple  Vertex- 
Pairs 

4.1  Independent  Segmentation  Error 

If  we  assume  that  the  variation  in  the  projected  angles  of 
vertex-pairs  is  due  to  statistically  independent  segmentation 
error,  then  a  composite  error  metric  for  a  set  of  vertex- pairs 
can  be  defined.  That  is, 


if 


Where  is  the  expected  variance  in  t  he  angles  associated 
with  a  particular  vertex-pair  in  the  segmentation.  In  general, 
the  variance  of  edge  angles  in  the  segmentation  is  not  directly 
related  to  the  associated  model  vertex- pair.  The  accuracy  of 
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angle  determination  between  image  boundary  segments  is  in- 
versly  proportional  to  edge  length.  The  length  is  controlled  by 
occlusion  and  edge  contrast  effects,  and  is  not  closely  related 
to  the  projected  edge  length  of  the  ideal  object  edge. 

Therefore  we  can  simplify  the  composition  process  by  as¬ 
suming  that  the  variance  is  the  same  for  all  vertex-pairs 
grouped  from  the  segmentation,  and  further  that  this  variance 
is  the  worst  case  value  corresponding  to  the  shortest  accept¬ 
able  segmentation  edges.  The  resulting  composite  error  metric 
is  simply  the  average  of  the  error  metric  for  the  constituent 
vertex-pairs. 

4.2  An  Example 

Earlier  this  year  we  carried  out  an  experiment  with  infra-red 
targeting  data  taken  from  the  air  attack  by  the  U.S.  on  Libya. 
These  images  were  digitized  from  a  video  tape  and  are  of  rather 
low  contrast  and  have  significant  random  noise  effects.  An  ex¬ 
ample  of  the  match  of  a  model  of  the  Russian  IL-76  transport 
aircraft  is  shown  in  Figure  11.  The  model  is  shown  superim¬ 
posed  on  the  image  data. 

We  have  repeated  this  experiment  to  observe  how  the  effec¬ 
tive  viewing  metric  behaves  for  the  individual  vertex-pairs  that 
were  selected  in  the  model.  Figure  12  shows  the  error  projec¬ 
tion  for  a  well-behaved  vertex-pair.  The  usual  degeneracy  and 
error  distribution  is  evident.  In  Figure  13  the  error  display  in¬ 
dicates  that  this  vertex-pair  is  almost  completly  useless.  There 
is  only  a  small  portion  of  the  viewsphere  that  provides  an  ef¬ 
fective  viewpoint.  The  main  reason  that  this  vertex-pair  is  so 
ineffective  is  that  one  of  the  edges  is  nearly  collinear  with  the 
spine  vector.  In  this  configuration,  the  projected  angle  for  that 


Figure  11:  A  match  on  infra-red  image  data.  The  composite 
error  metric  is  shown  projected  on  the  viewing  sphere. 

edge  does  not  vary  significantly  for  a  broad  range  of  rotations 
about  the  spine.  Thus,  the  variance  in  (d>,i ;■)  is  large  for  large 
regions  of  the  viewsphere. 

The  composite  error  map  in  Figure  11  shows  that  the  com¬ 
posite  is  still  effective  although  the  general  variance  level  has 


Figure  12:  A  well  behaved  vertex-pair  and  its  associated  error 
sphere. 

been  increased  over  large  areas  of  the  sphere.  The  composite 
is  computed  for  five  vertex-pairs  that  were  manually  selected 
from  the  model.  This  result  indicates  that  the  match  robust¬ 
ness  is  probably  low  and  that  we  were  somewhat  fortunate  to 
get  a  unique  match  in  this  experiment. 


Figure  13:  A  poorly  behaved  vertex-pair  and  its  associated 
error  sphere,  showing  large  areas  of  high  error. 

5  Automatic  Vertex-Pair  Selection 

5.1  Computing  the  Optimal  Vertex-Pair  Set 

A  major  application  for  the  vertex-pair  error  metric  is  in  auto¬ 
matic  vertex-pair  selection.  Given  a  large  number  of  vertices  in 
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the  model,  it  is  necessary  to  avoid  generating  a  correspondingly 
large  number  of  model  vertex-pairs.  In  particular,  it  would  be 
desirable  to  have  sufficient  vertex-pairs  defined  so  as  to  provide 
some  necessary  number  of  vertex-pairs  with  good  local  error  be¬ 
havior  for  every  possible  viewpoint.  We  want  to  minimize  the 
total  number  of  model  vertex  pairs,  M,  because  the  complexity 
of  the  algorithm  is  of  order  M(N  -  l)2  (sec.  1.3). 

In  general,  to  generate  all  possible  vertex-pairs  in  the  model 
requires  pairing  each  vertex  with  every  other  vertex.  The  only- 
characteristic  of  the  vertex-pair  geometry  whose  effect  on  error 
can  currently  be  evaluated  for  a  general  viewpoint  is  the  pair  of 
angles  between  the  spine  and  the  edges.  If  either  of  the  edges 
is  nearly  collinear  with  the  spine,  the  error  value  will  be  high 
over  a  wide  range  of  viewpoints.  By  limiting  the  incident  angle 
at  the  base  vertex,  we  can  reduce  the  number  of  model  vertex- 
pairs  generated  from  n2  -  n  to  c(n 2  -  n)  where  c  is  the  fraction 
of  (0  —  tr)  allowed  as  an  incident  angle  and  n  is  the  number  of 
model  vertices. 

The  proposed  next  step  in  generating  an  active  set  of  model 
vertex-pairs  is  ranking  the  c(n2  -  n)  model  vertex-pairs  gen¬ 
erated  above  according  to  the  surface  area  of  the  viewsphere 
in  which  they  demonstrate  low  error.  First,  we  apply  a  simple 
local  visibility  criterion  to  determine  the  area  of  the  viewsphere 
within  which  each  vertex-pair  will  be  visible  (the  intersection 
of  outside  half  spaces  of  the  faces  incident  on  the  vertex-pair). 
Then  we  may  compute  error  maps  for  the  remaining  areas  of 
the  viewsphere  for  each  vertex-pair.  The  surface  area  of  the 
viewsphere  which  is  visible  and  of  low  error  for  a  model  vertex- 
pair  is  the  value  of  that  vertex  pair  for  the  purpose  of  active 
set  selection. 

Finally,  we  may  select  vertex-pairs  by  searching  linearly 
through  the  ordered  set  of  candidates  until  all  of  the  points 
on  the  viewsphere  are  addressed  by  some  minimal  number  of 
vertex-pairs  or  we  run  out  of  vertex-pairs.  This  will  result  in 
the  optimal  group  of  model  vertex-pairs  being  selected.  If  we 
run  out  of  model  vertex-pairs  before  completely  covering  the 
viewsphere,  those  remaining  areas  of  the  view  sphere  are  still 
covered  by  vertex-pairs  which  exhibit  high  error  there. 

5.2  Using  High  Error  Vertex-Pairs 

Certain  viewing  conditions  require  the  use  of  high  error  valued 
vertex  pairs.  These  would  occur,  for  example,  when  looking  at 
a  model  from  a  viewpoint  such  that  all  visible  surfaces  lie  on  a 
plane  orthogonal  to  the  view  axis  (eg.  the  top  of  a  cube).  In 
this  case,  only  coplanar  vertex-pairs  would  be  visible,  and  they 
would  be  viewed  from  a  point  on  the  viewsphere  where  their 
error  value  is  known  to  be  high.  However,  although  the  only 
matching  features  available  have  a  high  error  value,  they  may 
still  return  the  correct  transform  values. 

High  error  means  a  vertex-pair  is  unable  to  precisely  de¬ 
termine  the  (d>,  v'-)  rotation  at  that  viewpoint.  It  will  therefore 
propose  a  number  of  possible  solutions  (d>,  y>)  for  a  given 
Areas  matched  with  high  error  vertex-pairs  will  therefore  be 
more  liable  to  produce  false  positive  responses. 

6  Conclusion 

YVe  have  defined  an  error  metric  that  quantifies  the  matching 
precision  of  the  vertex  pair  match  feature  as  a  function  of  view¬ 


point  and  geometry.  With  this  tool,  we  were  able  to  define  the 
concept  of  an  efficient  viewpoint,  using  minimal  sets  of  demon¬ 
strably  powerful  vertex-pairs  to  accomplish  the  matching  task. 

Our  next  topic  will  be  to  consider  the  role  of  viewpoint 
based  error  in  the  clustering  process  used  to  determine  the 
presence  of  matches.  Currently,  the  precision  of  a  model  vertex- 
pair  is  not  considered  in  transform  clustering.  Imprecise  model 
vertex-pairs  simply  propose  more  solutions.  Each  solution  is 
treated  as  a  separate,  equally  viable  result.  We  believe  the 
error  metric  presented  here  may  be  useful  in  determining  the 
likelihood  that  a  transform  cluster  is  a  false  positive  match, 
resulting  from  the  coincidental  contributions  of  a  number  of 
high  error  valued  vertex-pairs. 

We  will  also  explore  a  more  formal  understanding  of  the 
error  values  of  vertex-pairs  as  a  function  of  geometry.  We  hope 
that  this  may  facilitate  the  generation  of  an  optimal  set  of 
vertex-pairs  for  matching  by  replacing  more  of  the  ranking  by 
error  behavior  with  direct  ranking  by  geometry. 


.Tr.-v,  .v  v  A.v  ,x;v  •-  ,v-  ,v,v  •- .v  \  . 
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Abstract 

We  describe  an  information  storage  and  retrieval  mechanism  that 
avoids  the  requirement  of  consistency  maintenance  imposed  by 
traditional  knowledge- based  systems.  By  viewing  data  as  opin¬ 
ions  rather  than  facts,  the  system  is  able  to  combine  knowledge 
that  is  generally  accepted  as  being  true  with  data  that  may  be 
from  unreliable  sources.  We  also  give  a  formal  account  of  the  se¬ 
mantics  cf  the  approach.  The  information  management  system 
described  has  been  implemented  and  used  to  store  information 
derived  from  image  processing  along  with  opinions  from  other 
sources  about  objects  in  the  visible  world.  It  appears  to  he  well 
suited  for  the  requirements  of  an  autonomous  robot,  and  for  in¬ 
formation  storage  in  general. 

1  Introduction 

Database  management  systems  are  widely  used  to  store  and  re¬ 
trieve  facts  about  the  world.  They  provide  a  convenient  and 
efficient  means  to  access  vast  amounts  of  data  and  have  been 
the  basis  of  data  management  in  many  varied  applications.  In 
knowledge-based  systems,  they  have  been  used  to  manipulate  data 
that  encodes  the  knowledge  that  the  system  uses  to  carry  out  its 
activities.  In  these  circumstances  it  has  been  usual  to  impose 
upon  the  data  the  requirement  of  consistency.  The  database  is 
usually  expected  to  be  a  consistent  collection  of  facts  presumed 
to  be  true  about  the  world.  Conflicting  items  of  information 
are  filtered  from  the  database;  else  their  presence  would  result 
in  erroneous  conclusions.  For  this  reason,  great  care  must  be 
taken  to  ensure  that  false  data  are  not  inserted  into  the  database, 
and  much  effort  has  gone  into  developing  procedures  for  ensur¬ 
ing  the  integrity  of  the  data  store.  However,  there  are  numerous 
knowledge-based  system  applications,  autonomous  robots  being 
one.  in  which  the  requirement  of  database  consistency  is  inap¬ 
propriate. 

Whenever  information  is  provided  by  more  than  one  source  or 
when  the  validity  of  that  information  is  undeterminable,  it  may 
be  impossible  to  satisfy  tin  requirements  of  consistency.  For  ex¬ 
ample.  suppose  a  friend  tells  us  that  the  fishing  in  bake  Mohunk 
is  great  this  time  of  year,  and  we  wish  to  store  this  information 
in  a  database.  We  would  first  be  obliged  to  ensure  that  it  is  not 
in  conflict  with  anything  already  present  in  the  database,  and 
we  would  be  required  to  ensure  that  that  remains  the  rase.  In 
general,  whether  or  not  we  believe  a  statement  is  valid  depends 
on  a  number  of  factors: 

<  'redibility  of  I  he  source  If  our  friend  is  an  avid  fisherman  who 
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just  returned  from  the  lake,  we  should  be  much  more  willing 
to  accept  his  statement  than  if  the  source  of  information  was 
hearsay. 

Believability  of  the  statement  —  If  Lake  Mohunk  is  the  center 
of  a  fishing  resort  where  fishing  has  been  great  in  the  past, 
we  would  likely  accept  our  friend’s  statement  readily;  if  we 
know  that  the  lake  has  been  subject  to  pollution  we  would 
not. 

Consistency  with  other  statements  —  We  would  be  unwilling  to 
believe  our  friend  if  we  had  polled  six  fishermen  and  all  the 
others  told  us  that  fishing  at  the  lake  is  poor. 

Intended  use  of  the  information  —  If  we  were  just  curious  about 
our  friend's  recent  fishing  trip,  we  would  accept  his  state¬ 
ment,  but  if  we  were  planning  a  two-week  vacation  to  Lake 
Mohunk,  we  might  be  unwilling  to  accept  it  without  addi¬ 
tional  confirmation. 

The  traditional  knowledge-based  system  design  makes  it  difficult 
or  impossible  to  take  all  these  factors  into  account.  In  this  paper 
we  introduce  an  information  storage  system  that  was  designed  to 
overcome  this  limitation. 

2  Core  Knowledge  System 

The  approach  to  database  design  we  describe  in  this  paper  was 
motivated  by  the  need  to  store  information  about  the  visual  world 
in  an  autonomous  robot.  In  particular,  we  have  developed  a  Core 
Knowledge  System  (CKS)  [4]  to  serve  as  a  central  information 
manager  in  a  robot.  A  major  component  of  CKS  is  a  database 
that  encodes  the  spatial  and  semantic  properties  and  relations 
in  the  robot's  environment.  Because  CKS  accepts  data  from  a 
wide  variety  of  sources  (vision  and  range  sensors,  maps,  reason¬ 
ing  systems,  and  even  humans),  it  is  natural  to  expect  that  this 
information  will  be  mutually  inconsistent  and  error-prone. 

I'lie  CKS  database  is  a  storehouse  of  the  opinion >  of  many 
agents.  Ii  is  not  a  database  of  facts.  Phis  design  has  some  un¬ 
usual  properties-  properties  that  allow  it  to  function  as  a  central 
repository  of  information  for  a  community  of  processes.  It  also 
renders  much  of  the  previous  research  in  database  design  inap¬ 
plicable  and  has  spurred  us  to  develop  new  techniques  for  the 
storage  and  retrieval  of  multiple  opinions. 

The  rommunity-of-processos  architecture  adopted  for  the  CKS 
requires  that  processes  be  able  In  communicate  (heir  opinions 
with  one  another  and  without  undue  interference  from  processes 
with  competing  views.  The  formalization  and  use  of  an  opinion 
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base  in  lieu  of  a  database  gives  rise  to  the  following  important 
CKS  features: 

•  The  ability  to  store  information  that  is  inconsistent. 

•  The  ability  to  integrate  multiple  opinions  and  to  allow  that 
integration  to  depend  upon  the  intended  use  of  the  informa¬ 
tion. 

•  The  ability  to  separate  one  source’s  opinions  from  another’s. 

These  goals  have  been  achieved  while  retaining  an  ability  to 
incorporate  general  knowledge  that  is  universally  accepted  as  be¬ 
ing  true.  An  opinion  base  appears  to  us  to  be  a  better  model  of 
how  humans  store  information  than  is  a  database  of  consistent 
facts. 

3  Logical  Interpretation  of  the  CKS 
Database 

Because  of  the  presence  of  inconsistent  information,  first-order 
logic  is  insufficient  for  describing  the  semantics  of  the  database. 

In  what  follows  we  will  resort  to  a  modal  logic  of  belief  as  a 
means  for  interpreting  CKS  transactions.  This  logic  is  extremely 
expressive  and  could  be  used  to  specify  a  much  richer  collection 
of  knowledge.  However,  we  have  purposely  limited  the  scope  of 
our  logic  as  a  concession  to  efficiency  of  implementation.  We  are. 
after  all,  designing  the  CKS  as  the  core  component  of  an  au¬ 
tonomous  vehicle,  and  implementation  issues  cannot  be  ignored. 
For  this  reason,  the  ultimate  design  is  a  compromise  between 
the  ability  to  express  complicated  statements  involving  beliefs  of 
multiple  agents  and  the  ability  to  retrieve  relevant  information 
quickly.  Throughout  the  design,  we  have  been  guided  by  the  re¬ 
quirements  of  autonomous  vehicles  and  have  adopted  what  we 
feel  to  be  an  efficient  implementation  that  does  not  sacrifice  the 
ability  to  express  the  information  that  must  be  communicated 
among  the  various  processes  of  an  autonomous  system. 

3.1  Semantics 

information  is  stored  in  the  CKS  in  the  form  of  data  tokens.  A 
data  token  is  a  framelike  object  whose  internal  structure  is  related 
to  the  semantic  attributes  of  the  object.  Among  other  tilings,  a 
data  token  contains  information  about  the  spatial  location  of  an 
object  and  some  domain-specific  properties  » hat  are  believed  to 
lie  true  about  the  object.  For  example,  a  portion  of  a  data  token. 
tokenOl.  might  contain: 

tokenOl 

: SPATIAL -DESCRIPTION  (VI  V2) 

: SEMANTIC-DESCRIPTION  (LARGE  RED  POST) 

: HEIGHT  7.3 

Here  t  fie  spatial  dr srriptwn  is  a  fist  of  volume  specifications  (as 
we  have  previously  described  [  t])  in  which  the  object  is  presumed 
to  lie.  The  semantic  description  is  a  list  of  properties  that  the 
asserting  agent  believes  to  lie  true  about  tokenOl. 

I  lie  precise  meaning  of  CKS  transaction.-,  is  specified  with  the 
aid  of  modal  logic.  This  logic  consists  of  the  following  compo¬ 
nents: 

Object  Constants  File  collection  of  all  data  tokens  and  po¬ 
tential  data  tokens  in  (lie  database. 
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Function  Constants  —  None. 

Predicates  —  The  set  of  vocabulary  words  and  the  set  of  pos¬ 
sible  spatial  descriptions. 

We  adopt  the  usual  syntax  for  forming  well-formed  formulas 
(WFFs)  over  these  symbols  as  well  as  all  the  standard  axioms 
of  first-order  logic.  In  addition,  we  assume  the  existence  of  im¬ 
plicit  axioms  that  allow  us  to  infer  implicit  spatial  relationships 
and  to  make  use  of  semantic  relationships  encoded  in  a  semantic 
network: 

.  (Vx)[K(x)  -  K,(x)J 

whei  ever  the  volume  denoted  by  V,  is  completely  contained 
within  the  volume  denoted  by  V,,  where  V,  and  Vj  are  spatial 
descriptions. 

.  (Vx)[iy,(x)  -  iVj(x)] 

whenever  W,  and  Wj  are  connected  by  a  ‘subset’  relation  in 
the  semantic  network. 

•  (Vx)[iy,-(x)  — >  -liVjfx)] 

whenever  Wj  and  Wj  are  connected  by  a  ‘disjoint’  relation 
in  the  semantic  network. 

We  employ  a  modal  operator  B,  to  be  interpreted  as  meaning 
"agent  i  believes  that"  so  that  B,[4>]  is  interpreted  as  “Process  i 
believes  <5>  is  true."  This  operator  is  similar  to  that  described  by 
Moore  [3]  and  by  Konolige  [2].  However,  our  axiomatization  is 
different,  as  we  only  develop  here  those  formulas  that  are  needed 
to  interpret  the  action  of  the  database. 

The  following  axiom  schema  provides  the  modal  operator  with 
the  semantics  we  desire: 

•  <5>  —  $  B([$]  —  B,-[¥]  —  The  modal  operator  B,  is 

closed  under  deductive  inference. 

Significantly,  this  axiomatization  does  not  require  that  (B;[4>]  V 
)  is  true,  nor  does  >t  require  that  (B,[<£]  A B,[— '4>j )  is  false. 
However,  it  does  require  (B,[$j  V -iB;[$])  to  be  true  and  (B,[4>]  A 
^B,[$j)  to  be  false.  This  arrangement  allows  conflicting  beliefs 
to  exist  without  corrupting  the  database. 

For  any  proposition  (h.  an  agent  may  have  any  of  four  states 
of  belief,  which  are  listed  below.  An  example  is  given  of  an 
English  statement  that  would  be  encoded  by  each  of  t hose  states 
of  belief,  lit  the  example  <!>  =  House (object ).  and  V.r  Cetr(x)  => 
-i Housc(x)  is  included  in  the  semantic  network: 

•  “The  object  is  a  bouse."  (B,(4>]  A  ^B, [-,$]) 

•  “The  object  is  a  house  and  (B,[<!>]  A  B,[— >4>] ) 

it 's  also  a  car." 

•  “The  object  is  blue.”  ( — >B, [4>]  A  -,B,  p<hj ! 

•  “The  object  is  a  car."  (  ^B,[3>]  A  B, [— *1*] ) 


3,2  Insertions 
An  insertion  is  of  the  form 

(INSERT-DATA-TOKEN  tokenOl) 

where  tokenOl  is  an  instance  of  a  data  token: 


*/V  V* 
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tokenOl 

: SPATIAL-DESCRIPTION  (VI  V2) 

: SEMANTIC-DESCRIPTION  (LARGE  RED  POST) 
: HEIGHT  7.3 


Executing  (INSERT-DATA-TOKEN  tokenOl)  has  the  same  effect 
as  asserting 

B,[Vi(token01)]  A  B, [ Vj( tokenOl)]  A 
B,[i .4  f?G£(  tokenOl)]  A  B;(f?£-D(token01)]  A 
Bi[POST{  tokenOl)] 

Assertions  about  objects  can  only  be  in  the  form  of  a  con¬ 
junction  of  properties.  This  approach  allows  an  agent  to  assert 
that  token02  is  a  GREEN  HOUSE  but  does  not  allow  him  to  say. 


Additional  relations  may  also  be  defined  for  use  in  retrievals — 
this  facility  is  described  in  Section  3.4. 

Each  relation  divides  the  universe  of  objects  into  three  sets  rel¬ 
ative  to  its  argument(s):  those  that  are  known  to  satisfy  the  rela¬ 
tion,  those  that  are  known  to  not  satisfy  the  relation,  and  those 
for  which  no  determination  is  possible.  The  qualifiers  “appar¬ 
ently”  and  “possibly”  are  used  to  indicate  which  sets  are  desired 
in  any  particular  query.  To  describe  the  semantics  of  qualified 
queries,  there  are  two  cases  to  consider:  these  queries  for  which 
only  a  single  agent  has  provided  a  relevant  opinion,  and  those  for 
which  more  than  one  agent  is  involved. 

First,  we  enumerate  the  situations  in  which  there  is  only  a  sin¬ 
gle  agent.  $  is  “apparently”  true  for  the  following  combinations 
of  beliefs  of  an  agent. 


agent  desires  to  convey  this  information,  he  must  rephrase  it  as, 
for  example,  token02  is  a  BUILDING,  with  the  attendant  loss 
of  information.  If  it  is  truly  important  for  a  process  to  convey 

Belief  about  $ 

APPARENTLY 

Belief  about 

"’B,[4>] 

this  disjunction  precisely,  the  vocabulary  should  be  extended  to 

-B,[-i$] 

Yes 

No 

include  HOUSE-OR-BARN  as  an  acceptable  term.  It  is  our  ex- 

B,  -i$] 

Yes 

No 

pectation  that  processes  will  need  to  communicate  only  in  terms 
similar  to  those  that  have  evolved  for  human  communication,  and 
as  a  result,  that  there  will  be  little  need  to  resort  to  such  artificial 
constructs  as  HOUSE-OR-BARN. 

3.3  Queries 

The  syntax  of  the  query  language  is  provided  in  Appendix  A. 
The  language  differs  from  traditional  query  languages  because 
the  CKS  database  contains  information  that  is  both  incomplete 
and  inconsistent.  The  query  language  must  provide  the  user 
with  a  means  for  discerning  multiple  opinions.  For  this  reason, 
the  query  language  qualifiers  APPARENTLY  and  POSSIBLY  are  pro¬ 
vided  to  make  these  distinctions.  Loosely  speaking,  a  WFF  is 
true  “apparently”  if  there  is  some  agent  that  believes  it.  It  is 
true  “possibly"  if  there  is  some  agent  who  believes  it  or  there 
is  no  agent  *hat  believes  that  it  is  false.  These  notions  will  be 
formalized  shortly. 

As  can  be  l'een  in  Appendix  A.  queries  are  either  simple  or 
compound.  Simple  queries  are  a  list  of  three  items:  a  qualifier,  a 
relation,  and  an  argument.  The  argument  is  a  semantic  or  spatial 
description  or  something  else,  depending  upon  the  identity  of  the 
relation.  Each  relation  is  in  reality  a  three- valued  predicate,  with 
"1  don't  know"  being  an  acceptable  value.  This  allows  the  user 
of  the  CKS  <i.  reason  appropriately  when  information  is  either 
lacking  or  inconsistent.  Currently  four  relations  are  provided: 


Similarly,  $  is  “possibly"  true  for  the  following  combinations  of 
beliefs. 


Belief  about  $ 
POSSIBLY 

Belief  about 

B.[4>] 

-B.h*] 

Yes 

Bih*] 

Yes 

Before  providing  exact  formulas  for  the  interpretation  of  CKS 
queries,  we  examine  those  situations  in  which  4>  is  "apparently” 
and  "possibly"  true  when  there  are  two  agents  involved. 

<h  is  “apparently”  true  for  the  following  combinations  of  beliefs 
of  two  agents. 
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v  Beliefs  of  Agent  2 

\ 

APPARENTLY 

Beliefs  of  Agent  1  ^ 

B2[4>] 

A 

-B2(-*] 

B2[$] 

A 

B2h4>] 

A 

-B2[^] 

~ >B2  [4>] 

A 

Bj[-*] 

B,|<J>]  A  -B,(->$] 

Yes 

Yes 

Yes 

Yes 

B,  [4*1  A  B,  [-.♦] 

Yes 

Yes 

Yes 

Yes  | 

-B,[<J>]  A  -B,  [-<!>] 

Yes 

.  _Vl's  1 

No 

No  I 

a  B ,  j — 4> j 

Yes 

.  VoJ 

No 

No  j 

!,i:  1>.  4'  i-  ‘  (>- 
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'Beliefs  of  Agent  2 
^POSSIBLY 
Beliefs  of  Agent  1^ 

Ba[*] 

A 

-Bat-.*] 

Ba(*] 

A 

B2[-s*] 

-Ba[*] 

A 

-.Baf-.*] 

-Bat*] 

A 

Bj[$]  A 

Yes 

Yes 

Yes 

Yes 

Bi[*]  AB,[->$] 

Yes 

Yes 

Yes 

Yes 

-iBi[$]  A  -iBi[-i$] 

Yes 

Yes 

Yes 

No 

->Bi[*]  A  Bi  [->*] 

Yes 

Yes 

No 

No 

By  extrapolating  these  results,  we  are  in  a  position  to  describe 
the  interpretation  of  each  query  in  terms  of  our  modal  logic. 

•  (FIND-IDS  (APPARENTLY  IS  W4) )  returns  a  list  of  those  ob¬ 
jects  in  the  set 

{x  |  (3t)Bi[W«(x)]}  , 

where  W4  denotes  a  vocabulary  word.  In  English,  this  query 
corresponds  roughly  to  “Find  each  data  token  for  which  some 
agent  believes  W4  is  true  of  it.” 

•  (FIND-IDS  (POSSIBLY  IS  W4))  returns  a  list  of  those  ob¬ 
jects  in  the  set 

{*  |  [(3i)B,[W4(x)]  V  -i(3t)B,[-iVk4(x))} 

This  query  can  be  roughly  translated  as  “Find  each  data 
token  for  which  some  agent  believes  W4  could  possibly  be 
true.”  The  set  of  objects  which  satisfy  this  query  will  al¬ 
ways  include  those  objects  that  satisfy  the  APPARENTLY 
version. 

•  (FIND-IDS  (APPARENTLY  IN  V3) )  returns  a  list  of  those  ob¬ 
jects  in  the  set 

{x  |  (3t)B,[V'3(x)]}  , 

where  V3  is  a  spatial  description.  This  query  can  be  in¬ 
terpreted  as  “Find  each  data  token  such  that  some  agent 
believes  it  is  contained  within  the  volume  V3." 

•  (FIND-IDS  (POSSIBLY  IN  V3) )  returns  a  list  of  those  ob¬ 
jects  in  the  set 

{x|[(3t)BI[Y3(x)jv-1(3.)B,hK3(x)]}  . 

This  query  can  be  interpreted  to  mean  “Find  each  data  token 
such  that  some  agent  believes  that  it  could  possibly  be  in  the 
volume  denoted  by  V3." 

•  (FIND-IDS  (AND  Queryl  Query2>) 

(x  I  I  e  (FIND-IDS  Queryl ) A x  €  (FIND-IDS  Query2)}  . 

In  English,  this  can  be  translated  as  “Find  each  data  token 
that  satisfies  both  Queryl  and  Query2."  Il  can  be  thought 
of  as  the  intersection  of  the  sets  of  tokens  that  satisfy  each 
q  uery. 

•  (FIND-IDS  (OR  Queryl  Query2)) 

{ x  ;  x  e  (FIND-IDS  Queryl)  Vx  f  (FIND-IDS  query2)j 

This  can  be  translated  as  “Find  each  data  token  that  satisfies 
either  Queryl  or  Query2."  It  '.an  be  thought  of  as  the  union 
of  the  sets  of  tokens  that  satisfy  each  query 


•  (FIND-IDS  (AND-NOT  Queryl  Query2)) 

{x  |  x  6  (FIND-IDS  Queryl)  Ax  0  (FIND-IDS  Query2)}  . 

This  query  can  be  interpreted  to  mean  “Find  all  data  tokens 
that  satisfy  Queryl  but  that  do  not  also  satisfy  Query2.  ”  In 
set-theoretic  terms,  it  is  the  set  difference  of  the  sets  of  data 
tokens  that  satisfy  Queryl  and  Query2. 

In  answering  negative  queries,  the  database  has  no  closed- 
world  assumption  [1]  (i.e.,  it  does  not  assume  that  a  proposition 
is  false  if  it  cannot  be  proved  true),  thus  it  avoids  issues  of  non¬ 
monotonicity  and  has  no  need  for  circumscription.  A  negative 
belief  cannot  be  inserted  in  the  database.  However,  a  negative 
belief  can  be  deduced  using  the  deductive  inference  axiom  and 
the  general  knowledge  incorporated  in  the  vocabulary.  For  exam¬ 
ple,  a  process  cannot  assert  that  token03  is  not  a  PINE.  But  if  it 
does  asset.,  that  the  token  is  an  OAK,  (i.e.,  B,[OAA'(token03)]), 
the  database  will  infer  that  B,[-.P/N£(token03)]. 

3.4  User-Defined  Relations 

In  addition  to  the  predefined  relations,  a  new  relation  can 
be  defined  by  providing  a  LISP  function  a*  the  relation  in  a 
query,  it  may  take  any  arguments  that  are  appropriate  for 
it,  but  must  return  T,  NIL  or  :  MAYBE.  For  example,  a  relation 
WEI GHS-MORE-THAN  could  be  defined  as  follows. 

(defun  WEIGHS-MORE-THAN  (token  ref-weight) 

(let  ((weight  (RETRIEVE-SLOT-OPINION  token 
: WEIGHT  ’LATEST))) 

(cond  ((null  weight)  : MAYBE) 

((>  weight  ref -weight)  T) 

(T  NIL)))) 

Programmers  who  write  procedural  relations  are  cautioned  to 
write  them  efficiently.  While  the  transaction  parser  will  attempt 
to  optimize  the  query  evaluation,  nevertheless  it  will  sometimes 
be  forced  to  evaluate  the  relation  on  a  large  list  of  data  tokens. 

The  qualifier  is  used  to  interpret  the  results  of  the  three- valued 
predicate’s  evaluation  of  each  data  token  in  order  to  construct  the 
list  of  tokens  that  are  to  be  returned: 

APPARENTLY  —  The  list  returned  contains  only  those  tokens 
for  which  the  function  evaluates  to  T. 

POSSIBLY  —  The  list  includes  ail  those  tokens  for  which  the 
function  evaluates  to  T  or  :  MAYBE. 

For  the  standard  relations,  the  above  description  indicates  the 
effect  of  a  query,  but  not  the  implementation.  The  actual  algo¬ 
rithm  used  for  the  standard  relations  is  much  different  in  order 
to  achieve  high  performance  on  very  large  databases. 

3.5  Discussion 

There  are  several  restrictions  upon  the  statements  that  an  agent 
can  make  about  the  world  and  on  the  types  of  queries  that  can 
be  posed.  These  restrictions  were  necessary  to  enable  a  practical 
implementation  of  the  database. 

The  query  language  only  allows  a  limited  variety  of  queries. 
Acceptable  queries  are  limited  both  by  their  syntax  and  by  the 
vocabulary  of  properties.  The  query  language  is  not  intended  to 
be  a  universal  language.  We  have  designed  it  so  that  the  only 


queries  that  can  bo  posed  are  those  that  can  usually  bo  rotrioved 
efficiently,  given  our  database  architecture.  If  is  important  to 
boar  in  mind  that  the  limitation  of  the  query  language  is  one 
that  restricts  only  what  questions  can  bo  answered  efficiently; 
it  does  not  prevent  the  identification  of  data  tokens  that  satisfy 
an  unusual  query.  When  faced  with  a  question  that  cannot  be 
posed  as  a  syntactically  legal  query,  a  user  can  obtain  the  exact 
retrieval  by  first  retrieving  a  superset  of  the  desired  data  tokens 
with  an  acceptable  query,  and  then  examining  each  token  in  that 
set.  individually  for  satisfaction  of  the  intended  query. 

4  Slot  Access 

The  query  language  described  in  Section  3  provides  the  means  to 
insert  data  tokens  into  the  CKS  database  and  to  retrieve  pointers 
to  those  tokens  based  upon  spatial  and  semantic  criteria.  In 
this  section,  the  various  mechanisms  for  gaining  access  to  the 
information  contained  within  a  data  token  are  described. 

Data  tokens  are  stored  as  frames  consisting  of  a  number  of 
slots.  Kxternally,  each  slot  has  a  single  value.  Internally,  however, 
a  separate  value  is  maintained  for  each  process  that  offers  an 
opinion.  Conceptually,  an  association  list  that  pairs  the  process 
name  with  its  opinion  is  stored  for  each  slot. 

SLOT-NAME :  ((PI  .  VI)  (P2  .  V2)  . . .  ) 

When  retrieving  a  slot's  value,  one  specifics  the  combination 
method  desired  to  combine  all  values  that  have  been  previously 
provided  for  that  slot.  This  approach  makes  it  possible  to  inte¬ 
grate  multiple  opinions  in  a  manner  that  is  suited  to  the  task  at 
hand.  For  example,  if  a  robot  wants  to  determine  whether  its 
camera  will  have  an  unobstructed  view  over  a  fence,  it  should 
use  that  opinion  that  has  the  greatest  value  for  the  HEIGHT  slot. 
On  the  other  hand,  if  its  goal  is  to  keep  all  the  cows  in  a  confined 
area,  it  should  be  interested  in  the  smallest,  value  of  HEIGHT. 

T  he  following  function  is  used  to  store  a  new  opinion  as  the 
value  of  a  slot. 

(INSERT-SLOT-OPINION  <id>  <slot>  <value> 

^optional 

<auxiliary-data-f ields>  ) 

INSERT-SLOT-OPINION  causes  <value>  to  1m*  stored  as  the  opin¬ 
ion  of  the  calling  process  for  the  <slot>  of  lie*  data  token  denoted 
by  <id>.  II  the  process  has  already  provided  an  opinion,  it  is  re¬ 
placed  by  <value>.  The  strength  of  belief  and  lime  of  belief 
may  be  specified  in  <arxiliary-data-f ields>.  If  present,  this 
in h >nnat ion  is  stored  in  tie*  iiMenml  representation  and  ran  be 
retrieved  with  RETRIEVE-SLOT-OPINION.  If  <slot>  is  not  the 
name  of  a  currently  known  slot  in  <id>.  a  new  slot  is  created 
whkh  provides  t li<*  f ; m ■  i I i t y  for  a  user  i o  extend  a  data  token  so 
as  in  include  a  slot  of  his  own  choosing. 

Sir  it  values  are  retrieved  from  data  tokens  using  the  following 
f  u  net  ion 

(RETRIEVE-SLOT-OPINION  <id>  <slot>  <arbitrator>) 

RETRIEVE-SLOT-OPINION  return-  two  values.  II,.-  first  ran  he 
viewed  ,i-  the  vain**  of  <slot>  lot  data  token  <id>.  1  he  pat 

leuJnj  value  returned  i-  d*-t oj mined  bv  <arb iterator >.  whiih 
pe<i)i*n  h o v,  multiple  opinion-  are  to  I »»-  integrated  bv  this  in 
vo.  a  (ion  of  RETRIEVE-SLOT-OPINION.  I  sen, ml  value  ionium-, 

t  !i<  auxiharv  data  a  -»>i  i  .*  t  r  •<  |  with  tin-  r  * -t  ti  i  im  **  1  opinion  lie* 


<aibitratof>  can  be  any  from  the  following  list,  or  a  program¬ 
mer  can  use  a  special-purpose  procedure  by  supplying  that  pro¬ 
cedure  as  the  value  of  <arbitrator>.  Additional  arbitrators  may 
be  added  in  the  future  to  support  a  larger  variety  of  techniques 
for  information  integration. 

T  he  functionality  of  each  choice  is  as  follows: 

DEFAULT  returns  the  default  value  stored  in  the  semantic  net¬ 
work,  and  possibly  inherited  from  a  superclass.  This  default 
can  be  considered  as  another  opinion  of  the  value  of  the  slot 
on  the  object  denoted  by  <id>. 

LATEST  returns  the  opinion  that  was  most  recently  provided  to 

the  CKS. 

MIN  returns  the  smallest  of  all  the  opinions.  MIN  uses  aiithmctic 
comparison  if  its  arguments  are  numeric,  uses  alphabetic 
comparison  for  arguments  that  are  strings  or  symbols,  and 
is  undefined  otherwise. 

MAX  returns  the  largest  of  all  the  opinions. 

AVG  ret  urns  the  arithmetic  average  of  all  the  opinions.  It  ignores 
any  opinion  thnl  is  noimumerir. 

(PROCESS  <proc-name>)  returns  the  opinion  that  was  most  re¬ 
cently  provided  by  the  process  denoted  by  <proc-name>. 

LIST  returns  a  list  of  all  opinions  that  have  been  rendered. 

ALIST  returns  an  aiist  of  all  the  opinions.  Karh  pair  is  of  the  form 
<proc-name>  .  <value>.  where  <proc-name>  is  the  name 
of  a  process,  and  <value>  is  the  most  recent  opinion  pro¬ 
vided  by  that  process. 

<procedure>  is  the  name  of  a  function  or  a  lambda  expression 
provided  by  the  programmer.  Its  single  argument  is  the  re¬ 
sult  that  would  have  been  returned  if  ALIST  had  been  used. 
<procedure>  should  return  a  value  that  is  to  be  viewed  as 
the  integration  of  all  opinions.  For  example,  a  simple  im¬ 
plementation  of  an  opinion  preference  scheme  could  be  im¬ 
plemented  by  (RETRIEVE-SLOT-OPINION  <id>  ’PRIORITY 
<slot>),  where  PRIORITY  is  defined  as 


(defun  PRIORITY  (aiist) 

(or  (edr  (assq  Process-1  aiist)) 

(edr  (assq  Process-2  aiist)) 

(edar  aiist))) 

It  is  al-.o  possible  to  retrieve  an  entire  data  token,  given  a  token 

<  id>. 

(RETRIEVE -DATA -TOKEN  <id> 

^optional  <slot-pr ocoss-al ist> ) 

(liven  ,!  Tata  token  <id>.  this  function  returns  a  flavor  instance 
win*  <•  ‘..lots  are  filled  with  values  a>  determined  by  the  optional 
argument.  jtv  d<  |ril||t  .  e,ti  h  Tot  |  erej  Ves  t  he  lliosi  |e<o)|(  opinion 

evpte  i  d  bv  anv  pr»H  e  .  fi.e..  LATEST).  I o  override  i  |m*  default . 
a  pan  of  I  lie  form  (<slot.>  <arbi  trator> )  i  ituluded  on  the 
<sl  ot.  -  p  rocess- a  1 1  st  > 
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5  Summary 


We  have  described  an  information  storage  and  retrieval  mech¬ 
anism  that  avoids  the  requirement  of  consistency  maintenance 
imposed  by  traditional  knowledge- based  systems.  By  viewing 
data  as  opinions  rather  than  facts,  the  system  is  able  to  combine 
knowledge  that  is  generally  accepted  as  being  true  with  data  that 
may  be  from  unreliable  sources.  We  have  also  given  a  formal  ac¬ 
count  of  the  semantics  of  such  an  approach. 

The  design  differs  significantly  from  other  data  management 
systems  by  allowing  information  to  be  integrated  at  retrieval  time 
rather  than  requiring  all  data  to  be  made  consistent  at  insertion 
time.  While  this  approach  requires  storage  of  a  larger  volume  of 
data  than  would  be  required  by  other  data  management  systems, 
it  has  several  significant  advantages: 

•  It  affords  the  opportunity  to  integrate  information  according 
to  the  demands  of  the  current  task. 

•  It  allows  the  use  of  information  that  may  not  be  available  at 
insertion  time. 

•  It  eliminates  the  need  for  fusing  information  that  is  irrelevant 
to  the  ongoing  task. 

The  information  management  system  described  has  been  im¬ 
plemented  and  incorporated  within  the  Core  Knowledge  System. 
It  has  been  used  to  store  information  derived  from  image  pro¬ 
cessing  along  with  opinions  from  other  sources  and  from  users.  It 
appears  to  be  well  suited  for  the  requirements  of  an  autonomous 
robot,  and  for  information  storage  in  general. 
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A  Syntax  for  the  CKS  Query  Language 


<transaction>  : :=  (FIND-IDS  <query>)  I 

(FIND-CLASSES  <query>)  I 
(INSERT-DATA-TOKEN  <data-token>) 

( RETRIEVE-D ATA-TOKEN 

<id>  <slot-process-alist>) 
(INSERT-SLOT-OPINION 

<id>  <slot>  <value>)  I 
(RETRIEVE-SLOT-OPINION 

<id>  <slot>  <arbitrator>) 

<query>  ::=  <simple-query>  I  <compound-query> 


<simple-query>  ::= 


<compound-query> 


(<qualifier>  IN 

<spatial-description>)  I 
(<qualifier>  IS 

<semantic-description>)  { 
(<qualil ier>  HAS-AS-PART 

<  semantic -description)  I 
(<qualifier>  IS-PART-OF 

<semantic-description>)  I 
(<qualitier>  <3-valued-relation> 

[<args>  ...]  ) 

=  (OR  <query>  . . .  <query>)  I 
(AND  <query>  . . .  <query>)  I 
(AND-NOT  <query>  <query>) 


<qualif ier>  ::=  APPARENTLY  I  POSSIBLY 


<semantic-description>  ::=  (<vocabulary-word>  ... 

<vocabulary-word>) 


<arbitrator> 


=  DEFAULT  I  LATEST  I 
MIN  I  MAX  I  AVG  I 
(PROCESS  <process-name>) 
LIST  l  ALIST  | 
<procedure> 
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ABSTRACT 

In  this  paper  we  review  the  development  of  IU  envi¬ 
ronments  and  their  significant  features.  We  begin  with  a 
history  of  machine  vision  environments.  We  then  detail 
the  basic  components  of  general  IU  environments  with  re¬ 
spect  to  representations,  programming  constructs,  system- 
specific  data  bases  and  user  interfaces.  We  look  at  how 
these  components  have  been  realized  in  several  different 
systems  through  examples  and  conclude  with  some  possi¬ 
ble  developments  in  the  near  future. 

1.  INTRODUCTION 

There  is  an  enormous  diversity  of  work  in  image  un¬ 
derstanding,  ranging  from  the  development  of  algorithms 
to  building  integrated  systems;  from  research  dealing  with 
fundamental  questions  to  specific  applications  with  de¬ 
manding  requirements.  Over  the  past  25  years,  several  dif¬ 
ferent  software  environments  have  been  developed  to  sup¬ 
port  and  integrate  IU  activity.  These  environments  were 
initially  based  on  simple  constructs  for  programming  spe¬ 
cific  mar  bines  and  maintaining  reusable  libraries  of  subrou¬ 
tines  common  to  vision  research  and  applications.  They 
have  now  evolved  into  integrated  systems  of  considerable 
computational  and  representational  power,  reflecting  the 
range  of  problems  researchers  in  computer  vision  deal  with 
and  incorporating  much  of  what  has  been  learned  about 
machine  vision.  These  systems  ate  referred  to  as  Image 
Understanding  Environments  or  IU  Environments. 
In  this  paper  we  review  several  of  them  and  abstract,  their 
basic  functional  components  in  order  to  identify  underly¬ 
ing  principles  of  their  construction  and  to  indicate  future 
developments. 

A  key  attribute  of  If?  environments  is  that  they  ab¬ 
stract  the  types  of  objects  and  underlying  computational 
geometries  used  in  machine  vision  into  useful  programming 
constructs.  This  allows  programming  in  terms  of  a  virtual 
architecture  reflecting  these  representations  and  spatial  >u- 
ganizations.  ft  is  possible  to  develop  algorithms  as  though 
there  are  individual  processors  associated  with  each  pixel 
or  each  extracted  image  object  ill  diverse  neighborhood 
t. /pologies.  Abstraction  is  not  only  useful  for  expressing 


algorithms  concisely  and  developing  them  rapidly  but  also 
supports  machine-independent  code  development  and  inte¬ 
gration.  Abstraction  is  becoming  increasingly  important 
as  more  vision  machine  architectures  are  built.  The  im¬ 
portance  of  rule-based  processing  over  spatially  indexed 
data  bases  of  objects,  such  as  junctions,  curves,  regions, 
surfaces,  volumes  and  perceptual  groups,  in  addition  to 
images,  has  become  clear  in  the  past  decade.  To  repre¬ 
sent  such  objects,  image  understanding  environments  have 
been  built  on  top  of  powerful  object-oriented  programming 
environments  which  support  abstract  type  definition,  com¬ 
bination  and  inheritance  mechanisms.  Programming  con¬ 
structs  in  IU  environments  also  reflect  the  variety  of  spatial 
organizations  developed  in  vision  research,  which  include 
local,  iterative  processes,  pyramid  and  multi-resolution 
grid  organizations  and  generalized  Hough  transforms. 

IU  environments  also  supply  the  necessary  tools  for 
organizing  the  complexity  of  vision  research  and  building 
vision  systems.  In  a  typical  development  ,  a  researcher  may 
work  with  hundreds  of  related  images,  millions  of  extracted 
image  objects,  code  writ  ten  by  several  different  people,  and 
different  software  packages  ranging  from  numerical  routine 
libraries  to  expert-system  shells.  Some  environments  now 
provide  automatic,  intelligent  data  base  mechanisms  for 
keeping  track  of  processing  history,  results  and  relevant 
code.  There  are  also  rich  user  interfaces  whic  h  support 
rapid  access  to  these  data  bases  and  feu  displaying  objects 
and  processing  status.  Many  of  these  are  highly  interactive 
and  allow  great  flexibility  in  controlling  how  information 
is  presented  or  how  an  ob  jec  t  is  displayed 

An  important  aspect  of  vision  work  is  that  it  almost 
always  involves  several  people.  IU  environments  directly 
impact,  the  culture  of  such  groups  of  rescan  hers  and  en¬ 
gineers  bv  enhancing  communication  through  a  c  ommon 
representational  framework  and  constructs  which  support 
code-sharing  and  data.  They  allow  the  rapid  prototyping 
and  testing  of  ideas  and  significantly  enhance  the  learning 
of  novic  e  users. 

2  HISTORY 

Image  proc  essing  softer, m-  has  evolved  fnuu  lti.n  him  ■ 
specific  software  packages  used  fen  lntc  i.u  live  Ullage  anal- 
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ysis  to  machine-independent  tools  and  representations  for 
building  autonomous  vision  systems.  Several  trends  have 
been  a  part  of  this  evolution.  The  power  of  hardware  and 
systems  software  has  increased  to  the  point  where  a  su¬ 
percomputer  of  a  few  years  ago  will  soon  sit  on  someone’s 
desk.  The  representations  and  operations  used  in  vision 
work  have  expanded  from  images  and  local  pixel  oper¬ 
ations  to  a  larger  and  more  abstract  set  of  objects  and 
computational  geometries  for  such  constructs  as  pyramids, 
curves,  regions,  surfaces  and  volumes.  In  some  systems 
this  has  been  achieved  through  the  use  of  powerful  object- 
oriented  programming  environments.  In  addition,  environ¬ 
ments  have  integrated  techniques  and  representations  such 
as  rule-based  inference  systems,  hypothesis  management 
schemes,  and  blackboards.  User  interfaces  have  changed 
from  simple  image  displays  and  inflexible  batch-oriented 
command  processors  to  multi-window  displays  that  allow 
the  interactive  manipulation  of  graphic  objects  and  text. 
Databases  have  been  added  to  keep  track  of  objects,  pro¬ 
cessing  history  and  software.  Current  systems  are  now 
more  open  and  extendable.  They  can  remain  viable  and 
take  advantage  of  new  representations,  new  techniques  and 
new  hardware  within  the  same  framework. 

Early  systems  designed  to  support  human  analysis  of 
images  include  PAX  [Joh70j.  and  VICAR  [Cas79],  A  use¬ 
ful  overview  of  several  early  systems  ran  be  found  in  Pre¬ 
ston’s  review  article  (PreS  1] .  These  systems  generally  as¬ 
sisted  human  interpretation  of  aerial  and  satellite  images 
for  strategic  and  environmental  purposes.  There  was  a 
heavy  emphasis  on  interactive  noise  removal  and  applica¬ 
tion  of  local  image  operations,  such  as  thresholding,  con¬ 
trast  enhancement  and  edge  extraction,  to  increase  the  vis¬ 
ibility  of  features  to  human  operators.  Nearly  all  consisted 
of  a  subroutine  library  written  in  FORTRAN  and  assembly 
language.  The  libraries  included  utilities  for  the  display  of 
and  conversion  between  different  formats,  geometric  oper¬ 
ations  such  as  scaling  and  rotation,  image  transforms  such 
as  noise  removal  and  Fourier  analysis,  arithmetic  operators 
for  matrices  and  complex  numbers,  image  measurement 
subroutines  for  computing  histograms,  and  basic  pattern 
classification  routines.  These  libraries  were  usually  used 
through  an  interactive  command  processor  which  consisted 
of  a  top  level  that  allowed  the  names  of  library  subroutines 
t cj  be  typed  together  with  their  arguments.  Some  of  these 
command  processors  grew  into  a  more  general  interactive 
interface  to  the  operating  system  as  a  whole. 

The  development  of  software  packages  and  environ¬ 
ments  to  support  research  in  building  autonomous  vision 
systems  began  in  the  late  1960s  and  is  continuing  to¬ 
day.  An  exact  chronology  is  difficult  because  most  envi¬ 
ronments  have  been  through  many  stages  of  development, 
but  there  are  clear  innovations  associated  with  several  of 
them.  One  of  the  earliest  was  associated  with  the  Hand- 
eye  system  developed  at  Stanford  [I.JC71.  IFIGS] .  The 
UMIPS  [DKPR83]  system  from  the  University  of  Mary¬ 
land  consisted  of  a  standard  library  of  subroutines,  but 
it  also  introduced  programming  constructs  for  specifying 


neighborhood  operations  over  images.  The  basic  program¬ 
ming  constructs  were  efficiently  implemented  and  could  be 
combined  to  yield  new  and  combinable  neighborhood  op¬ 
erators.  UMIPS  leveraged  strongly  off  the  facilities  pro¬ 
vided  by  the  UNIX  operating  system  and  its  philosophy  of 
building  small  tools  that  can  then  be  combined  through  a 
command  language.  HIPS  [LCS84]  is  another  system  that 
made  heavy  use  of  these  same  abilities  in  UNIX.  Both 
UMIPS  and  HIPS  associate  processing  histories  with  im¬ 
ages  for  a  primitive  results  database. 

The  VISIONS  OPERATING  SYSTEM  [Ke83|  from 
the  University  of  Massachusetts  at  Amherst  was  developed 
in  the  mid  1970's  and  continues  to  support  the  ongoing 
development  of  the  VISIONS  image  understanding  system 
(HR87j.  It  has  since  been  through  several  expansions.  It 
was  one  of  the  early  systems  to  integrate  the  numerical 
and  symbolic  parts  of  vision  research  by  using  FORTRAN 
for  numerically  intensive  image  operations,  and  by  using 
LISP  for  symbolic  processing  necessary  for  higher  level  vi¬ 
sion.  It  contained  several  programming  constructs  for  ex¬ 
pressing  neighborhood  operations  in  registered  images  and 
pyramids  and  for  iterating  these  local  operations  in  several 
different  ways.  Databases  for  maintaining  the  status  and 
relationship  between  objects  in  the  current  processing  en¬ 
vironment  and  for  long  term  results  were  also  provided. 
Cooperative  code  development  among  a  large  number  of 
graduate  students  was  supported  by  semi-automatic  mech¬ 
anisms  for  generating  help  files  that  described  routines  de¬ 
veloped  by  users  and  added  to  the  system.  In  this  way. 
a  growing  library  of  software  was  developed  by  a  large 
set  of  researchers.  Recently,  the  VISIONS  OPERATING 
SYSTEM  has  been  used  as  the  basis  of  a  commercially 
developed  IU  environment  called  KBVision  [KBV87J. 

The  SRI  Image  Understanding  Testbed  was  developed 
in  the  early  1980s  as  a  system  of  hardware  and  software 
to  facilitate  the  integration,  testing  and  evaluation  of  re¬ 
search  bv  application-oriented  groups  using  results  from 
die  DARPA  IU  Community.  It  combined  the  low  level 
tools  and  high  level  modules  developed  by  several  mem¬ 
bers  of  this  community.  The  IU  testbed  was  based  upon 
a  common  set  of  hardware  (a  VAX  780  running  Unix  with 
a  Grinnell  display  device)  to  be  located  at  each  site.  The 
low  level  tools  supplied  by  various  groups  included  C  hased 
definitions  for  a  command  processor,  displays,  routines, 
images,  and  routines  for  manipulating  strings,  lists,  vectors 
and  matrices.  The  defined  objects  and  displays  amounted 
to  an  effort  to  implement  object-oriented  constructs  in  a 
conventional  language  by  defining  centralized  routines  that 
could  select  an  appropriate  routine  to  deal  with  the  specific 
qualities  of  an  object  passed  to  it.  For  example,  there  wore 
functions  lor  opening  or  closing  images  that  would  work 
on  any  of  the  several  supported  image  formats.  This  did 
not.  however,  support  the  clean  expression  of  inheritance. 
1  lie  high  level  modules  included  programs  for  three  di 
liiensional  model-based  vision,  determining  steieo  <  attiera 
transfoims.  stereo  correlation,  generalized  Hough  tiaiis- 
forms.  finding  lnu  s.  recursive  region  segmentation  and  le 
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taxation.  These  high  level  modules  wore  written  in  differ¬ 
ent  languages  and  most  of  them  did  not  make  use  of  the 
lower  level  tools.  The  Testbed  probably  did  not  become  an 
evolving  standard  because  it  did  not  offer  enough  support, 
for  writing  new  programs  and  representations,  depended 
on  standardized  hardware,  and  had  a  limited  interactive 
interface. 

Image  Calc  [Qua84|  was  developed  at  SHI  during  the 
same  period  as  the  SHI  Testbed  and  introduced  several 
significant  advances.  It  is  based  on  tin*  Symbolics  LISP 
Machine  which  is  an  interactive,  single  user  workstation 
with  an  exceptionally  powerful  environment.  The  entire 
system  is  written  in  ZetaLISP  and  the  object-oriented 
programming  system  Flavors.  The  uniformity  provided 
by  using  LISP  at  all  levels  of  vision  programming,  both 
pixel  and  object,  supports  the  development  of  abstract 
objects  and  tin*  integration  of  different  levels  in  vision  re¬ 
search.  The  LISP  development  environment  makes  rapid, 
focused  testing  and  debugging  possible  while  supporting 
compilation  for  efficiency.  Image  Calc  provides  a  database 
caching  mechanism  that  allows  high  level  procedures  to 
share  computationally  expensive  operations  without  pav¬ 
ing  the  penalty  of  doing  tin*  same  operation  multiple  times. 
The  user  interface  provides  many  different  wavs  of  view¬ 
ing  and  interacting  with  images,  and  selecting  operations 
through  tin*  use  of  menus. 

Powervision  |RM88|,  developed  at  Advanced  Decision 
Systems,  is  also  baser]  on  the  .Symbolics  LISP  machine. 
Standard  abstract  was  defined  definitions  for  several  types 
of  objects  beyond  images  including  curves,  regions  and 
groups.  It  introduced  tin*  idea  of  virtual  images  which 
are  images  whose  contents  are  defined  by  a  functional  clo¬ 
sure  JS.J84]  that  describes  how  to  transform  the  values  in 
another  image.  The  image  is  virtual  in  the  sense  that  no 
actual  storage  is  required  and  the  functional  closure  r  an  be 
arbitrarily  nested.  Powervision  defined  browsers  for  inter¬ 
acting  with  databases  for  objects,  processing  histories  and 
results.  Powervision  has  been  extended  in  a  more  recent, 
system  named  View  [MKB‘87j.  V  iew's  main  innovations 
are  the  use  of  Common  LISP  [S.l84|  and  CL<)S  [Bl)(t*87l 
to  define  a  machine  independent  environment  with  object- 
oriented  piograiumiiig  and  explicit  support  foi  multiple 
repo-sent atioiis  of  the  same  absliae  t  object.  It  associates 
basic  programming  constructs,  sin  h  as  access  and  ite;a- 
iiou.  with  image  iimbi  st  muling,  objects,  lather  than  with 
tin*  programming  language,  .so  that  these  opeiations  «  an 
1m*  e.isilv  extended  to  in  W  |ep|  r  sent  at  iolis.  I  lie  Ilotiotl  of 
virtual  images  has  been  extended  to  a  largei  « lass  of  ob 
je<  tv 

Aii  emphasis  in  reienl  environments  is  to  integrate 
image  uudeisiaudiug  en vji onneiii s  with  inoie  general  Al 
tools.  Significant  examples  .ire  tin-  SHI  (  "ie  knowledge 
Svstetll  tSS.ST  .  Intel, n  dolls  between  I  lie  \  ISIONS  (  )  I  *  I'.  I X 
Ad  INC  SYSdi.M  and  tin  (MASS  Cem  n,  Bl.e  klh.aid 
iCCMM.1  |  a  •  •  je<  t  like  \  nit  A  j  A  bib '•■Vi  and  tin  (Ml 
l,in|.|.  r  ISSMi,.  I'li.-w  .* i *  l.lii.  kl....n*l  l.av-.l  -> 
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modules  into  a  working  system. 

Then*  have  been  several  IU  environments  developed 
for  specific  hardware  architectures.  These  can  be  very 
roughly  grouped  into  commercial  systems  and  research 
prototypes.  Some  commercial  systems,  e.g.  those  devel¬ 
oped  by  Bausch  is:  Loud),  Leitz,  Jovce-Loebl  and  others 
in  the  1970s,  were  most  easily  accessible  to  a  user  at  the 
command  or  menu  level  [Pro81).  They  took  advantage 
of  special  purpose  hardware  to  speed  up  the  interactive 
processing  of  images  greatly.  Dedicated  interface  devices, 
such  ;is  keyboards  with  predefined  keys  for  basic  opera¬ 
tions  and  trackballs,  were  also  included.  Later  commercial 
systems,  such  as  the  Environmental  Research  Institute's 
CytoOomputer,  tin*  PIXAR  and  the  WARP  |HVVVV87], 
included  subroutine  libraries  and  custom  languages  for 
writing  new  routines  for  the  hardware.  Some  of  these 
c  ustom  language's  are*  MORPHAL  [Lan81|,  PPL  [Gud79|. 
C3PL  [Pre81]  and  Apply  |H\V\V87].  Programming  envi¬ 
ronments  associatecl  with  commercial  systems  tend  to  be 
tied  strongly  to  the  underlying  hardware,  but  some  do  j>ro- 
vicle  an  interface  to  conveiitioual  programming  language's 
like*  C  and  FORTRAN.  The*  environments  associated  with 
research  machines  are  much  more  open-ended.  Program¬ 
ming  languages  arc'  a  necessary  level  of  interface  to  them. 
There  is  a  tight,  coupling  between  these*  experimental  vi¬ 
sion  architect .tires  and  the  programming  constructs  in  im¬ 
age*  understanding  environments  for  vision  research:  the* 
machine's  embody  the  programming  constructs  directly. 
The  late*  1970's  also  saw  the*  development  of  several  lan¬ 
guages  orient e<l  toward  supporting  image  understanding 
on  parallel  SIMI)  and  M IMD  machines.  Some  of  these  lan¬ 
guages  were  Pascal  PL  1 1  *  h  r«S  1 )  and  MAO  (DouSl].  These 
languages  were  designed  to  b<*  implemented  on  both  se¬ 
quential  and  parallel  machines. 

2.1.  JO  ENVIRONMENT  \ROim  EC’TE RE 

('intent  IE  software*  environments  have*  very  similar 
functional  components,  arranged  in  different  topologies, 
a  ml  ref'  *i  red  to  by  different  names.  Common  to  all  of  t  hem 
.ire*  ob  ject  types  used  in  image  understanding,  such  as  junc¬ 
tions.  images,  histograms,  and  progi nmming  r  oiistim  Is  for 
manipulating  them.  Scvet.d  euviioiiuients  ate  built  toj> 
of  lie  h  ptogi ,n inning,  * 'ii v n oiiment \  \v!ii<  h  stippoit  tin*  c  ie 
ation  of  new  type  s  of  objects.  Tin*  environments  also  m 

<  bide  seve  ral  different  types  of  databases  hn  keeping  tr.ick 
of  sin'll  things  -is  objevi  instances  created  eluting  an  ajipli- 
catioii.  permanent  ie*sults.  and  code  deve)ojM*d  by  multi¬ 
ple  users  1  l»e*re  .'lie  highly  lllteia«  f  t  *  lilt*/  face  *s  to  these- 
elilfe  lent  datab. cses.  fo|  acee-ssing  objeefs  and  displacing 
them.  As  all  example-,  ligUle  1  shows  the  al«  llitec  Mile  of 
I  he  PoWe-l  vision  l  IlVIleMHIU  Ut  . 

:  IMAGE  I 'NDERS' TANDING  (HUEO  I  S 
AND  PROGRAMMING  CONN  I  REG  I  S 

1  h<-  s<  *  1 1 1  a  u  1 1  •  s  oj  se\>ial  IE  ■ 1 1 1 1«  ’<  t  s  and  o|  <<  i  a  1 1  uis. 
V  e  m  I  .1 1 1  \  ihosi*  e|e-.dlltg  Willi  Spatially  indexed  objee  ts  alld 
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Figure  1:  Power  vision  Environment 

trims  of  mathematical  mappings,  functional  composition, 
and  r<  lations  (s<‘e  work  on  intake  algebras  (R\\87|).  Otlwi 
objects,  such  as  databases  ami  inference  pioo-ssos.  aie 
mure  i)pen-end('(l  and  tend  to  b<v  <'oinj»lelc  modules  instead 
of  hoi u £  direct! v  e\pie.vse<l  jjj  tcims  of  defined  program¬ 
ming  constructs.  There  is  an  important  trend  of  lmiMinj* 
IF  objects  ami  programming  constructs  on  top  of  general 
object -oriented  programming  environments  [f'ox8-t|  as  op¬ 
posed  to  developing  complete  IU  programming  language's 
from  scratch,  ibis  approach,  combined  with  the  seman¬ 
tics  of  functional  mappings,  enables  us  to  begin  with  a 
very  general  object  typo  corresponding  to  a  mathematical 
mapping,  and  refining  it  into  a  class  of  geoinetiic  objects 
common  in  IF.  These  are  mappings  from  a  set  of  dimen¬ 
sional  indices  which  can  be  discrete  or  continuous  into  a 
set  of  values.  The  power  of  this  abstraction  is  that  it  allows 
object  operations  and  combinations  to  In*  expressed  with 
considerable  gem  iality.  For  example,  convolution  can  be 
defined  as  ;i  general  operation  that  works  on  any  spatially 
registered  objects  with  different  dimensions  and  represen¬ 
tations.  I  he  ability  to  combine  objects  makes  the*  defini¬ 
tion  of  new  types  of  objects  easier. 

1  'here  ate  four  key  parts  to  object-oriented  program¬ 
ming:  encapsulation,  generic  func  tions,  late*  binding  and 
inheritance.  Encapsulation  refers  to  techniques  for  hid 
ing  the  actual  implementation  of  representations  behind 
an  abstiaet  functional  interface.  (Irneric  functions  cm 
respond  to  this  abstract  func  tional  interface*.  A  generic 
function  calls  the  most  specific  function  definition  asso¬ 
ciated  with  the  class  i  y  p»*  of  its  arguments.  I  lie  use*  of 
genetb  functions  and  encapsulation  let  changes  be  marb¬ 
le  ,»  repo-sent ntimi  without  'hanging  the  procedmes  that 

manipulate  that  n -pres*  illation.  In  this  way.  proce-.Iures 
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use  the  same  interface  for  ail  objects  that  are  of  the  same 
type-  even  though  they  may  have  very  different  internal 
representations.  Late  binding  means  that  a  generic  func¬ 
tion  call  is  bound  to  an  actual  procedure  only  when  it 
is  called,  not  when  it  was  written  by  the  programmer  or 
compiled  into  a  system.  This  allows  a  generic  function  call 
to  run  different  procedures  when  it  is  called  with  different 
classes  of  objec  ts.  Since  this  happens  at  run  time,  the 
same  procedure  can  be  called  on  objec  ts  of  two  different 


representations  of  the  same  abstract  type  and  will  work 
correctly  with  both. 

Inheritance  is  very  important  to  organize  a  large 
object-oriented  system.  When  new  classes  of  objects  are 
defined,  they  inherit  operations  from  previously  defined 
c  lasses.  Inheritance  also  provides  a  basis  for  the  incre¬ 
mental  development,  of  a  system  while  maintaining  the  old 
abilities  for  other  representations.  There  are  frequently  a 
ric  h  array  of  tools  in  an  object -orionled  programming  en¬ 
vironment  for  finding  out  about  objects,  the1  relationships 
between  them  and  the  op.  i.itiuiis  defined  on  them.  This 
organization  and  these  tools  significantly  reduce  the  com¬ 
plexity  of  extending  a  system.  Essentially,  a  new  object 
is  defined  by  saying  that,  “this  object  is  just  like  these 
othe  r  objects,  except  for  ...”,  without  knowing  all  of  the 
details  of  the  implementations  of  the  other  objects.  Code 
is  organized  into  large-  related  chunks,  while  allowing  the 
ability  to  reuse  and  extend  pieces  without  affecting  the 
old  definitions.  Classes  have  a  few  general  kinds  of  generic 
functions  associated  with  them:  definitions  inherited  from 
other  classes,  new  definitions  that  are  unique  to  a  class, 
and  definitions  that,  shadow  an  inherited  definition  by 
modifying  or  replacing  it.  When  a  generic  function  is 
called  on  an  objec  t  in  a  c  lass,  the  new  or  shadow  defi¬ 
nitions  take  precedence  over  any  inherited  definitions. 

There  are  three-  general  classes  of  primitive  generic 
functions  associated  with  common  geometric  objects  used 
in  IU  Environments:  1)  Accessors.  2)  Boundary  checking, 
and  3)  Iteration.  More-  complex  operations  are  expressed 
in  terms  of  these  primitive  generic  functions.  By  using  (he 
primitive  generic  func  tions,  the  same  definition  can  work 
with  objects  that  have  very  different  internal  represent  a- 
tioiis.  Definitions  can  still  be  assoc  iated  with  a  partic  ular 
representation,  but  tins  is  not  generally  necessary  since- 
most  of  the  processing  happens  in  the  access  and  itera¬ 
tion  constructs  that  arc-  already  associated  with  particular 
representations.  Some  operations  that  c  an  be-  de-fined  in 
tc-rms  of  these  primitives  include  display  routines  as  shown 
in  Figure  T,  and  <  (involution  as  shown  in  Figinc  a. 

An  accessor  is  a  generic  function  used  to  get  or  set  a 
property  of  a  geometric  object.  Some  accessors  deal  with 
attributes  of  an  object  such  as  noil-spatial  summary  prop¬ 
erties  like  length  or  average  coiil  rasl  foi  a  c  urve.  Ii  isnalie 
l.d  and  convenient  to  iui|lh'lueut  -it  least  two  t  allot  general 
tyi-i's  of  accessors,  (liven  a  domain  index,  the  lust  tvpe 
returns  a  spatial  location,  the-  second  tvjcc-  e-timis  olliei 
values  associated  with  that  loc  ation  As  an  example,  these 
two  aecessols  lol  a  •  lll'Ve  extracted  flolll  all  image  Til  a  J  -  a 
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one  dimensional  domain  index  into  the  two  dimensional 
image  space  that  the  curve  lies  in  and  into  the  pixel  value 
found  at  that  location.  Geometric  objects  can  be  embed¬ 
ded  in  each  other  to  define  neighborhoods  over  which  a 
set  of  values  are  accessed.  This  is  useful  when  functions 
are  applied  over  a  local  area  with  respect  to  some  position. 

A  basic  problem  with  accessors  is  handling  indices 
that  are  outside  the  domain  of  a  finite  geometric  object. 
Boundary  checking  operations,  which  compare  a  set  of  in¬ 
dices  against  the  domain  of  an  object,  have  several  options 
when  the  indices  are  outside  an  object’s  domain.  Options 
include  signalling  an  error,  returning  a  default  value,  or 
executing  a  procedure  to  calculate  a  new  value.  When 
geometric  objects  are  combined,  boundary  checking  can 
occur  for  the  domain  of  each  object.  For  example,  an  ac¬ 
cessor  for  a  region  object  embedded  in  an  image  can  access 
values  using  its  specified  boundaries  operations  or  from  the 
domain  of  the  image. 

Another  basic  operation  for  geometric  objects  is  to 
iterate  an  accessor  over  the  object’s  domain.  There  are 
several  ways  to  control  iteration.  If  no  particular  ordering 
is  required,  the  iteration  can  be  done  in  parallel.  To  control 
the  order  of  values  in  a  sequential  operation,  the  order 
that  dimensions  and  index  positions  are  stepped  through 
must  le  specified.  These  correspond  to  the  mapping  and 
walking  constructs  found  in  VIEW. 

Figures  2  through  5  show  four  different  definitions  of 
convolution  in  the  UMIPS,  Image  Calc,  Powervision  and 
View  environments.  They  show  how  basic  image  process¬ 
ing  operations  have  been  expressed  in  different  environ¬ 
ments.  UMIPS  is  representative  of  environments  that  use 
conventional  programming  languages  such  as  C  and  FOR¬ 
TRAN  that  lack  object-oriented  programming  constructs 
for  the  definition  of  operations.  Operations  are  defined 
in  terms  of  subroutine  calls,  and  all  access  and  iteration 
is  done  through  the  constructs  of  the  language  or  a  ba¬ 
sic  neighborhood  operator.  Image  Calc,  Powervision  and 
View  are  representative  of  more  recent  environments  that 
are  designed  for  powerful  single  user  work  stations.  They 
are  all  written  in  LISP  and  use  object-oriented  program¬ 
ming. 

Figure  2  shows  an  example  from  the  UMIPS  environ¬ 
ment.  In  that  environment,  programs  are  defined  as  small 
pieces  that  are  put  together  by  the  use  of  Unix  pipes.  Pro¬ 
grams  are  usually  defined  using  the  programming  language 
C.  MAGIC  is  a  facility  for  defining  image  operators  over 
the  pixels  in  a  rectangular  neighborhood.  It  takes  care 
of  applying  the  operator  over  an  image  and  saving  results 
into  a  registered  output  image.  Boundary  conditions  are 
handled  by  insuring  that  neighborhoods  will  never  refer¬ 
ence  outside  of  an  image.  Start  sets  up  a  neighborhood 
procedure  by  defining  the  size  and  center  of  the  neigh¬ 
borhood  and  by  setting  up  the  input  and  output  images. 
Local  defines  the  actual  neighborhood  operation.  Local 
is  called  once  for  each  neighborhood  that  fits  within  the 
constraints  of  an  image.  This  leaves  a  dead  zone  around 
the  boundaries  in  the  output  image  where  no  values  are 


/*  Convolution  in  UMIPS  Magic  */ 
•includs  "stdlo.h* 

•include  •cvl.h*’ 

•dsflns  ONINS  1 

•dsflns  0_NOUTS  1 

•include  "magic. h" 


•define  HEIGHT  S 
•define  WIDTH  6 
float  kernel  [HEIGHT] [WIDTH]  =  { 


{0.0026,  0.0126, 

0.0200,  0.0126, 

0.002S) 

{0.0126.  0.0826. 

0.1000,  0.0826, 

0.0126) 

{0.0200.  0.1000, 

0.1800, 

0.1000, 

0.0200) 

{0.0126. 

0.0826, 

0.1000, 

0 . 0626 , 

0.0125) 

{0.0026, 

0.0126, 

0.0200. 

0.012B, 

0.0026) 

>; 


/*  Magic  start  up  •  / 
startup (argc ,  argv) 
int  argc ; 

char  *argv[]; 

{ 

o_height  =  HEIGHT;  /•  Neighborhood  height  and  vldth  */ 

o~wldth  =  WIDTH; 

o_j_cntr  =  o_helght  /  2;  /*  Neighborhood  center  */ 

o_x_cntr  =  o_wldth  /  2; 
lnplcf (argv [1] ) ; 
outplcf("",  8); 

> 


/*  Magic  local  operator  */ 
local (nbd,  result) 

short  **nbd [0_NINS] ,  /•  Local  neighborhood  */ 

result [0_N0UTS] ;  /•  Result  of  calculation  •/ 

{ 

int  x,  y; 

float  sum  =  0.0; 

for  (y  =  0;  y  <  ohelght;  y*+)  { 
for  (x  =  0;  x  <  ojsldth;  x*+)  { 

sum  ♦=  nbd [0]  [yf  [x]  *  kernel  [y]  [x] ; 

> 

> 

result  to]  =  sum; 


Figure  2:  Convolution  in  UMIPS  MAGIC 

stored.  The  result  of  the  operator  is  automatically  stored 
at  the  same  position  in  the  output  image. 

Figure  3  shows  the  same  algorithm  programmed  in 
Image  Calc.  Image  Calc  defines  images  using  the  object- 
oriented  programming  language  Flavors  [SymSG] .  Generic 
functions  associated  with  images  include  geometric  and 
non-geometric  accessors,  and  operations  for  display  and  file 
operations.  Defun-cached  defines  a  function  that  caches 
its  results  in  a  processing  relations  database  as  described  in 
Section  4.  Image  Calc  has  a  virtual  image  representation 
that  allows  low  cost  coordinate  transformations  as  long 
as  the  mapping  is  separable  into  the  composition  of  an 
x  mapping  and  a  y  mapping.  Make-bordered-image 
defines  an  image  that  transforms  border  image  coordinates 
by  replication  or  wrap  around  before  returning  the  pixel 
found  in  a  norma]  image.  With-image-elements  is  an 
expansion  ‘environment  that  causes  the  iref  inside'  of  the 
form  to  be  expanded  into  more  efficient  code 

Figure  4  shows  the  same  algorithm  programmed  in 
Powervision.  In  this  example,  defiu  corresponds  to  the 
expansion  environment.  It  generates  standard  keyword 
arguments,  error  checking  and  recovery  code,  database 
management  code,  hooks  for  interactive  function  appli- 
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( convolve- tMyi  (i«fi  k  *1-1*1 ) 

(t«t»  ((*-dla  ( tMg*-*-<Ji*  iMf») 

(y-ai»  (1aag«-y-<l1a  1a*g*)) 

(N*m*1-width  ( «rr »y-d  1  mm 1 on  k#m*l  0)) 

(k*owl-h*l*»t  (trray-diaan>lon  k«rr*l  1>) 

(k*rn« I -half -width  (Moor  komol -width  2)) 

(korool-half-hotqht  (floor  k«m*l-h»1ght  2)) 

(OonMr-typo  :r*f>Hc*U)  ;  Bor4«r-typ«  My  6*  :r«pMc»t«  or  :wrap-ar«tmd 
(PWOM-Iati  (Mko-ttordorod-lMgo 

<**9B 

(Hot  »ort**(-««lf-widUi  (-  komol-wtdUi  k*m* I -half -width) ) 

(Hat  kortMl-hair-hoight  (-  komol -Might  k*m*l-half-h#iaht)) 
oomar-typo  borOor-typo  )) 

(into- lap  (Mka-laag*  (Hat  R-dla  y-dlm)  lalaaant-typ*  !  iM9*-.l*B*nt-typ*  laago))) 
(raault-ftnj  (1a*g*-fliq>  into-laago)) 


(klUi-lafga-fiMnu  (pa dcfctJ-'Mg*  (Into-laaga  .-«rfu» 

(loop  for  into-y  froa  0  bolow  y-dim  do 
(ioop  for  mto-x  froa  0  boloa  x-di« 

for  va)  ■  (loop  for  ky  froa  0  baloa  komol -Might 
for  In-y  froa  Into-y 

«*  (loop  for  fcx  froa  0  bolow  komol -width 
for  In-x  froa  mto-x 
a ua  (•  (arof  komol  kx  ky) 

Orof  Daddod-iaaga  in-x  ln-y)})) 


do  (aavf  (lr*f  mto-iMgo  into-*  into-y) 
(If  roau!t-f1*p 
(floor  vat) 
val ))))) 

into-iaogt)) 


Figure  3:  Convolution  in  linage  Calc 


;;;  Convolution  in  Poworvlplon 
(do f lu  convolution  (input  karnol) 
monu  utility 
input  (laago) 
output  (oaoothod) 

*lth  (wr  (array-dlaonalon-n  1  karnol)) 

(wc  (array-dlaonolon-n  2  karnol)) 

(cr  <//  wt  2)) 

(cc  (//  wc  2)) 

In  wt  by  wc 

boundary  (lag  rolloct) 
contorr  cr  contPrc  cc 

do  (poot  (loop  for  lr  froa  (-  cr)  to  cr 

auaalng  (loop  for  lc  froa  (-  cc)  to  cc 

tuaaalng  (•  Ura*  karnol  (♦  lr  cr)  (*  lc  cc)) 
(pgot-r  lag  lr  lc)))> 


oaoothod)) 


Figure  4:  Convolution  in  Powervision 

cation,  boundary  checking,  and  efficient  loop  generation. 
It  contains  knowledge  about  providing  defaults  and  effi¬ 
ciently  implementing  common  image  processing  idioms. 
The  menu  keyword  indicates  that  this  function  should 
be  put  into  the  utility  menu  so  that  it  can  be  interactively 
applied  with  the  mouse.  The  input  and  output  keywords 
define  the  objects  that  are  linked  together  with  the  pro¬ 
cessing  history  in  a  database  as  described  in  Section  4.  Bv 
default,  the  output  object  is  a  copy  of  the  input  object 
without  the  same  values.  The  with  keyword  defines  some 
local  bindings.  The  in.  by.  centerr  and  centerc  keywords 
indicate  the  size  and  center  of  the  rectangular  neighbor¬ 
hood  being  used.  The  boundary  keyword  indicates  the 
type  of  boundary  checking  to  be  performed.  This  bound¬ 
ary  checking  code  is  generated  in-line,  so  that  the  same 
function  cannot  be  called  on  objects  that  require  different 
kinds  of  boundary  checking.  The  looping  over  each  loca¬ 
tion  in  the  input,  image  is  automatically  generated  around 
the  code  marked  by  do.  The  pget-r  s  are  access  func¬ 
tions  that  get  pixels  relative  to  the  current  window’s  ren¬ 
ter.  Each  of  these  pgets  automatically  expands  into  code 
that  reflects  pixel  accesses  outside  of  the  image  back  into 
the  image. 

Figure  5  shows  the  same  algorithm  programmed  in 
View.  This  is  a  very  general  version  of  convolution  in  that 
both  the  input  object  and  the  kernel  can  be  arbitrarily 
shaped,  and  the  same  routine  can  be  called  witli  objects 
thai  have  very  different  representations.  This  generality  is 
a  result  of  using  ait  object-oriented  design  that  associates 
operations  such  as  access  and  iteration  with  objects.  To 


; ; ;  Convolution  in  via* 

(dafaathod  convolution  ((input  g«oa«trlc' 

(kernol  gaomatric)) 

(lat  ((output  (copy  Input  :dat»  nil  :ala«ant-typa  'float))) 
(objact-accaa*  ((Input  : boundary  ’boundary-raf lact)  output) 
(nap-locatlon-valuaa 
»’ (lambda  (r  c  kraat  plxalc) 

(lat*  ( (aua  0)) 

(walk-valuaa  »' (laabda  (valua) 

(lncf  tun  (•  valua  (pop  plxala)))) 
karnal) 

(aatf  (vgat  output  r  c)  oua) ) ) 

Input 

: valua-accaaaor  (nalghborhood-accaaaor 

(accessor  Input  'vgat)  karnal))))) 

(llnk-to-adb  'convolution) 

Figure  5:  Convolution  in  View 

restrict  the  processing  of  an  algorithm  to  some  sub-region 
of  a  geometric  object,  a  new  object  is  created  that  com¬ 
bines  a  geometric  object  for  the  sub-region  and  the  original 
geometric  object.  This  is  efficient  since  the  pixels  outside 
of  the  sub-region  never  need  to  be  looked  at. 

The  defmethod  in  this  example  indicates  that  this 
definition  of  convolution  is  a  generic  function  associated 
with  input  and  kernel  objects  that  inherit  from  the  geo¬ 
metric  class.  Object-access  is  an  example  of  an  expan¬ 
sion  environment.  It  allows  a  programmer  to  give  type 
and  boundary  checking  information  and  to  indicate  that 
object  operations  that  are  inside  the  body  of  the  form  are 
going  to  be  happening  many  times.  As  a  result,  the  oper¬ 
ations  are  optimized  by  doing  the  setup  for  the  operation 
once  when  object-access  is  entered  rather  than  every  time 
the  operation  is  done.  Inside  an  object-access,  some  op¬ 
erations  like  vget  expand  into  more  efficient  code  than 
they  would  outside  of  the  object-access.  Vget  is  an  ac¬ 
cessor  that  given  an  object  and  a  set  of  domain  indexes 
returns  the  corresponding  value  in  the  range.  The  bound¬ 
ary  checking  lor  a  vget  is  specified  once  in  an  object-access 
form  rather  than  every  time  a  vget  is  used.  Here  refer¬ 
ences  outside  of  input  are  defined  to  reflect  back  into  the 
object.  Map-location-values  is  an  iteration  construct 
that  calls  the  function  marked  by  on  the  result  of  the 
location  accessor  and  the  neighborhood  accessor  for  each 
domain  index  in  the  object.  The  neighborhood  accessor  is 
constructed  by  using  the  kernel  to  define  the  offsets  from 
a  base  domain  index  and  calling  the  accessor  on  each  of 
those  domain  indexes.  Iteration  constructs  that  start  with 
map  indicate  that  the  operations  are  order  independent, 
so  that  they  can  be  implemented  in  parallel.  Those  that 
start  with  walk  are  called  in  a  specific  and  controllable  or¬ 
der.  The  second  iteration  construct  walk-values  is  used 
to  walk  over  the  values  associated  with  each  location  in 
the  kernel  and  to  sum  t he  product  of  that  value  and  the 
corresponding  value  from  the  input  object.  The  function 
link-to-cdb  indicates  that  this  function  should  lie  linked 
with  the  Environmental  DataBase(EDB)  as  described  in 
Section  4  View  separates  this  linkage  from  the  definition 
of  the  operation  so  that  a  particular  application  can  de¬ 
termine’  whether  entries  should  be  made  in  the  EDB  oi 
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4.  SYSTEM  SUPPORTED  DATABASES 

There  are  several  databases  supported  by  IU  Envi¬ 
ronments.  These  may  be  referred  to  as  the  Programmers 
DataBase,  the  Environmental  DataBase,  and  the  Long 
Term  DataBase  although  they  appear  in  several  differ¬ 
ent  forms  in  different  environments.  The  Programmers 
DataBase  (PDB)  contains  descriptions  of  processing  rou¬ 
tines  and  representations  provided  by  the  basic  system,  by 
library  modules  or  developed  by  other  users.  The  Envi¬ 
ronmental  DataBase  (EDB)  contains  objects  that  are 
produced  during  interactive  or  autonomous  processing  and 
the  relationships  between  them.  The  EDB  is  an  umbrella 
for  two  other  databases,  the  Object  DataBase  (ODB) 
which  contains  the  objects  generated  by  an  application, 
and  the  Processing  Relations  DataBase  (PRDB)  that 
records  the  functional  relationships  between  the  objects 
stored  in  the  ODB.  Figure  6  shows  the  EDB  records  gener¬ 
ated  by  a  call  fir«-'t  to  vgradient,  and  then  to  gradient-non- 
maxima-suppression.  Gradient-non-maxima-suppression- 
vector,  called  by  gradient-non-maxima-suppression.  actu¬ 
ally  generated  the  returned  result.  There  is  usually  a  dis¬ 
tinction  between  objects  in  the  immediate  or  interactive 
environment  and  those  which  are  stored  permanently  for 
later  recall  and  use.  The  Long  Term  DataBase  (LTDB) 
stores  permanent,  entries  from  the  EDB.  We  briefly  de¬ 
scribe  some  important  features  of  these  different  system 
Databases. 
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Figure  0:  EDB  example 


1.1.  OB.IKCT  DATABASE 

The  Object  Database  (ODB)  contains  all  of  the  glob¬ 
ally  accessible  structures  generated  by  mi  application,  c.g. 
images,  edges,  regions,  histograms,  match  results,  hypoth¬ 
esised  relations  between  objects  and  databases.  With  this 
database  and  interactive  tools  lot  exploring  it.  Midi  as 
browsers,  a  user  or  process  ran  refer  to  objects  by  their 
characteristics  as  well  as  their  names.  In  this  way.  it  pe»- 
%  ides  a  means  of  communication  between  difh-ient  m<  dub's 
which  can  refer  to  the  characteristics  of  the  ol.jei  Is  that 


they  are  interested  in  independently  of  the  module  that 
generated  the  object. 

The  ODB  is  supported  by  constructs  for  grouping  re¬ 
lated  objects  into  a  local  database.  In  this  way,  the  ODB 
can  be  organized  hierarchically,  which  is  necessary  because 
of  its  potentially  large  size.  For  example,  at  the  top  level 
might  be  found  images,  local  databases  for  edges  and  re¬ 
gions  extracted  from  the  same  image,  and  significant  per¬ 
ceptual  groups  created  bv  combining  primitive  structures 
found  in  an  image.  Attributes  typically  associated  with 
ODB  entries  are  the  time  of  creation,  the  object  stored, 
parent  entries,  children  entries,  and  a  name  or  number  for 
reference.  The  parents  slot  and  the  children  slot  are  used 
to  make  connections  to  the  entries  in  the  PRDB,  Through 
these  connections,  the  processing  history  of  an  object  is 
available. 

4.2.  PROCESSING  RELATIONS  DATABASE 

Tin-  Processing  Relations  Database  (PRDB)  records 
function  calls,  their  parameters  and  returned  results. 
PRDB  instances  typically  consist  of  a  time,  the  function 
name,  the  parameters  supplied  to  the  function,  the  ODB 
objects  used  as  parameters,  the  ODB  objects  returned  bv 
the  function,  notes  added  by  the  function  writer,  notes 
added  by  the  function  caller,  and  information  regarding 
the  context  in  which  the  function  was  applied.  Instances 
in  the  PRDB  can  also  correspond  to  running  processes  and 
provide  a  way  to  monitor  their  status. 

There  are  often  storage  management  operations  as¬ 
sociated  with  entries  in  the  ODB  and  PRDB.  A  user  or 
application  should  have  control  of  which  function  calls  an 
stored  in  the  PRDB.  so  that  only  function  calls  that  gener¬ 
ate  important  results  are  present.  An  object  can  be  faded 
so  that  the  storage  associated  with  the  object  can  lie  re¬ 
claimed.  When  an  object  is  faded,  its  ODB  entry  and 
its  parent  PRDB  entries  remain.  A  faded  object  can  still 
be  regenerated  from  its  processing  history  if  it  is  needed 
again.  A  fade  operation  trades  storage  space  for  process¬ 
ing  time.  When  an  object  is  saved,  its  ODB  entry  together 
with  its  processing  history  is  saved  with  all  of  the  ODB  en¬ 
tries  faded.  An  object,  can  also  be  forgotten  which  is  like 
fading  except  that  as  soon  as  all  of  the  result  ODB  entries 
associated  with  a  PRDB  entry  are  faded,  the  ODB  entries 
and  the  PRDB  entry  are  permanently  removed. 

An  important  use  of  the  PRDB  is  that  function  re¬ 
sults  can  be  cached.  If  a  function  is  railed  with  exactly 
the  same  parameters  as  before,  the  same  results  can  be  re¬ 
turned  without  (  idling  the  function  again.  Caching  makes 
it  easier  to  reuse  high  level  building  blocks  that  might  have 
several  operations  in  common  at  a  lower  level.  For  exam¬ 
ple.  a  function  that  goes  from  an  intensity  image  to  edges, 
and  another  function  that  goes  fiom  an  intensity  image  to 
regions  might  both  compute  the  gi. client  of  the  image.  If 
the  gradient  operation  is  stored  in  tin-  I’ll  Dll.  then  a  ipiii  k 
eliiM  k  of  the  PRDB  will  show  that  t lie  same  operation  was 
called  before,  and  the  old  result  can  be  reiimn-d. 
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4.3.  LONG  TERM  DATABASE 

The  Long  Term  Database  (LTDB)  contains  descrip¬ 
tions  of  all  permanent  ODB  objects.  The  LTDB  helps 
when  developing  an  application  to  manage  the  large  num¬ 
ber  of  objects  that  are  stored.  Having  a  large  number 
of  images  and  preprocessed  results  available  saves  time  in 
testing  image  understanding  routines  and  makes  it  possi¬ 
ble  to  test  a  routine  on  a  variety  of  data  sets.  Each  entry 
has  a  name,  type,  creation-date,  source,  documentation, 
directory,  files,  keywords  and  history  slot..  An  important 
attribute  of  the  LTDB  is  that  it  can  store  database  in¬ 
stances  as  objects.  Thus  EDB  instances  can  be  saved  as 
objects  to  retain  a  complete  processing  history  so  process¬ 
ing  environments  can  be  saved  between  sessions. 

4.4.  programmers  database 

The  Programmers  Database  (PDB)  automates  the 
management  of  function  and  object  definitions.  It  enables 
a  user  to  find  code  generated  bv  others  to  avoid  duplicat¬ 
ing  work.  This  is  particularly  important  where  the  envi¬ 
ronment  is  being  used  by  many  people  working  on  different 
applications  who  are  not  directly  communicating  with  each 
other  Automatic  maintenance  of  the  PDB  is  important 
because  users  might  not  spend  the  effort  needed  to  main¬ 
tain  it  manually.  Entries  in  the  PDB  are  automatically 
updated  each  time  a  file  is  compiled.  Entries  in  the  PDB 
have  dots  for  tin-  type.  name,  parameters,  documentation, 
hie.  author,  and  date  of  last  modification. 

5.  USER  INTERFACE 

The  user  interface  is  the  contact  point  between  a  user 
and  the  environment.  Current  interfaces  to  1U  Environ¬ 
ments  include  underlying  software  development  tools  such 
as  editors,  window-based  display  systems,  ami  browsers 
for  searching  large  textual  and  pictorial  databases.  I  he 
interface  provides  ways  to  view  objects  as  text,  networks, 
graphs,  images,  curves,  regions,  surfaces  and  volumes  in 
general  and  natural  ways.  It  is  important  that  an  inter¬ 
face  be  liigldv  interactive  because  it  is  used  constantly.  A 
bad  one  becomes  a  burden  very  quickly.  The  quality  of  an 
interface  can  also  change  as  a  user  changes.  Helpful  menu- 
based  systems  become  tedious  and  wasteful  as  novices  be¬ 
come  adept.  It  is  useful  if  the  interface  maintains  a  user 
model  so  each  user  can  specialize  their  interface  and  indi¬ 
vidualize  it  through  default  settings. 

5.1.  WINDOWS 

Window-based  TeiTt);  interfaces  are  powerful,  simple 
to  use.  and  now  very  common.  Most  systems  are  written 
using  machine-spec  ifi-  window  packages  that  provide  base 
abilities  for  defining  information  spac  es  and  manipulating 
windows  with  respect  to  thrill.  Tim  pat  tic  ul, ii  displays  in 
image  understanding  environments  are  then  built  cm  top 
of  this  basic  snbstiate.  Softwap.  standards  for  machine - 
portable  window  systems  such  as  X  iSGSty  and  Nr\\  S 
SunSTi  are  beginning  to  become  available-  and  will  lead 
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IU  and  Graphic  Environments  extend  window  systems 
in  several  ways.  One  of  these  is  to  use  windows  to  distin¬ 
guish  between  the  attributes  of  an  object  and  the  param¬ 
eters  describing  how  an  object  is  displayed.  This  allows 
operations  such  as  panning  or  zooming  to  be  associated 
with  a  window  and  then  inherited  by  any  object  displayed 
in  that,  window.  This  provides  a  default  display  context  for 
subsequent  object  displays  in  a  window.  Ill  environments 
often  associate  a  local  database  with  windows  to  main¬ 
tain  the  history  of  displays  at  the  window.  This  is  useful 
for  accessing  a  previous  display  without  going  through  the 
original  procedures  that  generated  it  and  for  displaying 
time-varying  information  such  as  image  sequences.  Win¬ 
dows  can  also  be  linked  so  that  changing  the  information 
in  one  window  causes  the  information  in  the  linked  window 
to  change.  This  is  especially  useful  for  multi-resolution 
views  of  objects.  Windows  with  different  display  types 
ran  also  be  linked.  A  typical  example  is  a  text  browser 
linked  to  an  object  display  window.  When  different  ob¬ 
jects  are  found  using  the  browser,  they  are  automatically 
displayed  in  the  object  display  window. 

5.2.  DISPLAY  TYPES 

Environments  typically  support  several  predefined 
types  of  displays  for  spatial  objects  such  as  images,  curves, 
regions,  vectors,  and  surfaces  in  addition  to  objects  such  as 
graphs,  network  and  text.  The  display  of  spatial  objects 
requires  efficiency  with  respect  to  the  underlying  repre¬ 
sentation  used  for  the  object  Some  representations  are 
expressed  as  discrete  sets  of  connected  points  while  oth¬ 
ers  are  represented  by  analytically  expressions.  This  can 
be  done  by  expressing  display  algorithms  in  terms  of  the 
basic  iteration  functions  described  in  Section  3  that  are 
associated  with  each  underlying  representation.  Because 
all  objects  can  be  accessed  the  same  way.  a  small  set  of 
display  functions  can  be  used  cm  a  wide  variety  of  objects. 
Figure  T  shows  a  simple  definition  of  the  generic  function 
display-object  for  displaying  geometric  objects  that  lie 
in  a  two  dimensional  space  in  a  two  dimensional  spatial 
display  window.  The  display  window  takes  care  of  the 
issues  of  zooming,  coordinate  transformations  and  color. 
The  generic  function  only  has  to  tell  the  window  the  Id¬ 
eation  of  the  pixels  in  image  coordinates,  map-locations 
i  alb  tin-  function  marked  by  with  the  location  of  each 
position  along  the  two  dimensional  object,  potentially  in 
parallel.  Tlus  definition  will  work  with  any  point,  curve 
or  region  that  lies  in  a  two  dimensional  space-  whateve  r 
its  internal  representation.  When  called  on  a  specific  ob¬ 
ject.  it  will  run  as  efficiently  because  the-  generic  function 
map-locations  is  defined  for  that  object’s  repiesentation. 

It  is  important  that  display  operations  can  be  used 
flexibly  to  highlight  the  characteristics  of  a  given  object. 
Simple  examples  of  this  arc-  mapping  pixel  values  onto  the 
range  of  available  display  intensities  for  optimal  rontiast 
when  displaying  an  image  or  color  coding  objects  by  the 
values  of  their  attlibutes.  Instead  of  defining  seveial  spe¬ 
cialized  displays,  this  can  be  clone-  using  virtual  objects 
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(d*f method  display-object  ((object  2d-geometrlc) 

(window  2d-spatlal-wlndow)) 
(map-locations  «' (lambda  (r  c) 

(plot-pixel  window  r  c>) 
object)) 

Figure  7:  Object  display  method 

A  virtual  object  is  a  functional  closure  [SJ84]  which  can 
be  applied  to  an  object  before  it  is  displayed.  The  func¬ 
tional  closure  can  be  reused  and  combined  with  other  func¬ 
tional  closures  to  produce  very  complicated  transforma¬ 
tions.  The  same  effect  could  be  achieved  with  functions 
that  generated  successive  objects  with  transformed  values, 
but  it  would  require  creating  potentially  large  intermediate 
objects  for  each  display. 

An  example  of  the  use  of  virtual  objects  for  display 
in  Powervision  is  shown  in  Figure  8.  The  middle  win¬ 
dow  shows  a  contour  map  of  an  elevation  image  with  the 
mouse  located  at  84,  24  on  the  terrain  grid.  The  leftmost 
window  shows  a  perspective  surface'  display  from  that  po¬ 
sition.  The  rightmost  window  shows  the  same  perspective 
view,  but  this  time  the  object  being  displayed  is  a  virtual 
object  that  flattens  any  elevation  pixels  that  are  near  the 
original  position.  A  function  to  do  this  flattening  is  shown 
in  Figure  9.  As  each  pixel  in  the  terrain  grid  is  accessed,  it 
is  tested  to  see  whether  it  is  near  the  original  position  and 
if  it  is.  then  a  constant  value  is  returned.  The  advantages 
of  using  a  virtual  object  are  that  no  large  image  needs  to 
be  allocated,  and  that  the  function  that  defines  the  vir¬ 
tual  object  is  only  called  on  the  pixels  that  are  actually 
accessed. 


Figure  8:  \  iitual  Object.  Display 
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(<  (abe  (-  24  *col*))  10)) 

276 

pixel) ) 
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Figure  9:  Surface  Display  of  a  Virtual  Object 

It  is  useful  for  graphics  to  be  associated  with  an  envi¬ 
ronment..  These  can  range  from  a  routine  library  for  draw¬ 
ing  points,  lines,  and  simple  geometric  figures  to  complete 


rendering  systems  for  perspective  displays  of  surfaces.  Au 
important  addition  for  graphics  commands  in  an  IF  En¬ 
vironment,  is  that  the  routines  have  the  capability  of  di¬ 
rectly  generating  objects  from  the  IU  environment  ,  such 
as  images,  curves  and  junctions  in  addition  to  a  display 
image.  This  is  useful  for  interactive  segmentation  and  the 
generation  of  idealized  data  for  testing  algorithms. 

5.3.  BROWSERS 

Browsers  are  an  interactive  query  mechanism  for 
finding  and  exploring  relationships  between  objects  ill 
databases.  Among  the  different  types  of  browsers  that, 
arc  used  are  text-oriented  browsers,  image  browsers  and 
graph  browsers.  All  of  these  tend  to  maximize  the  amount 
of  information  breadth  or  depth  presented  at  one  time  and 
provide  a  means  to  apply  functions  interactively  to  selected 
objects.  Browsers  can  be  sensitive  to  the  type  of  database 
being  queried  to  effectively  map  specialized  query  and  link- 
following  operations  onto  keys. 

The  text -oriented  browser  usually  appears  as  a  spread¬ 
sheet-like  presentation  of  text  which  describes  objects  or 
functions.  The  text  is  sensitive  to  a  pointing  device,  such 
as  a  mouse,  and  selected  text  is  high-lighted.  The  dis¬ 
play  of  entities  is  controllable  in  terms  of  the  database  of 
objec  ts  being  browsed  and  which  attributes  are  displayed. 
There  are  two  general  display  modes  associated  with  the 
text-oriented  browser.  The  "object/line  mode”  presents 
each  object  on  a  single  line  of  text  divided  into  fields  and 
maximizes  the  breadth  of  information  that  can  be  seen  at 
one  time.  The  “field /line”  mode  displays  one  object  at  a 
time  with  one  field  per  line,  maximizing  the  depth  of  infor¬ 
mation  displayed  about  that  one  object.  The  object/line 
display  is  useful  when  looking  for  a  specific  set  of  objects. 
As  the  set  of  objects  is  narrowed  down,  the  set  of  objects 
that  meet,  selected  specifications  are  shown.  The  field/line 
display  is  useful  when  stepping  through  a  subset  of  the  ob¬ 
jects  in  a  database,  looking  at  each  object  in  detail.  When 
interacting  with  a  database,  all  the  elements  in  the  text- 
display  can  not  be  shown  and  the  operations  is  to  select 
and  order  the  set  of  objects. 

Image  browsers  use  reduced  resolution  piet ures  of  ob¬ 
jects  which  are  sensitive  to  a  pointing  device.  Usually  a 
text  browser  or  a  filter  is  used  to  narrow  down  the  num¬ 
ber  of  images  to  browse.  The  image  browser  can  be  used 
to  scroll  through  the  pictures,  look  at  theii  entries  in  the 
I.TDI1  and  load  the  objects.  The  pictures  ate  ge  (crated 
automatically  on  a  request  to  browse  a  specific  object  and 
then  stored  so  that  if  that  object  is  browsed  again,  the 
browser  ran  use  the  previously  generated  picture. 

Graph  browsers  displays  objects  and  the  relationships 
between  them  as  trees  or  semantic  nets.  In  the  same 
wav  that  the  image  browser  shows  what  a  stored  object 
looks  like  more  clearly  than  its  textual  description  the 
graph  browser  shows  the  connections  between  objects  more 
<  learly.  It  can  be  used  to  look  at  the  relationships  between 
objet  ts  in  the  ODB  and  PR  DB.  or  to  trace  the  proi  essing 
flow  through  objet  f  models 
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Figure  10:  Display  \  allies 


5.4.  INTERACTIVE  DEVICES 

An  important  attribute  of  any  iiiteifme  is  the  extent 
to  whii'h  it  maps  I  oiltext-sensitive  functionality  onto  input 
devices  such  as  terminal  keys,  joysticks,  trackballs,  and 
mouse  keys.  For  example,  a  common  operation  in  an  II 
environment  is  to  find  something  th.it  was  done.  apply  a 
function  to  it.  and  thou  display  it.  (’sing  interactive  aids, 
this  can  involve  establishing  a  text -bast  d  browser  linked 
to  the  Object  Database  with  a  few  clicks  and  mouse  mo¬ 
tions.  A  lelevant  image  produced  sometime  during  a  long 
interactive  session  can  be  found  with  a  few  more  mouse 
actions  over  the  window.  The’  selected  image  c  an  then  1«* 
displayed  or  have'  some  routine  applied  to  it. 

Building  interactive  interfates  remains  an  art.  though 
certain  basic  principles  are  clear  and  intuitively  appeal 
ing  [Moi 79).  The  interface  should  have  explicit  access  to 
the  context  in  which  an  action  is  occurring  so  it  can  in 
terpjet  incompletely  specified  user  actions.  This  gener¬ 
ally  implies  a  history-state'  mechanism  Another  is  that 
the  common  spatial  and  behavioral  intuitions  that  people 
have  are  maintained  with  respect  to  the*  int«*i a*  t i v< •  de¬ 
vices.  Left  and  right,  up  and  down  should  be-  consistently 
maintained  in  the  object,  the  display  and  the  correspond¬ 
ing  device.  When  mapping  obj«  <  t  parameters  onto  generic 
control  keys,  if  the  parameters  have  a  spatial  moaning, 
this  should  be  reflected  in  the  position  ot  keys.  K\’‘  *  1 1*  n f 
examples  of  this  can  be  found  in  flic*  suifa*  <-pe rspe*  tive 
’nterface  and  mouse  krv  definitions  used  m  linage*  (  ah 

A  basic-  interactive  operation  supported  bv  many  < -n \  i - 
lonnients  is  a  general  point-atid-applv  fine  tioti  fot  i n •  1  i «  at 
ing  some  object  * »r  object-token  and  applying  a  function 
to  it.  The  applied  function  can  be  a  dispbo  action  "i  a 
mop'  complicated  recursive  iim  ,  f  the  point  applv  turn 
tion  rf self.  figure  19  shows  an  exam  »h*  fean 
of  such  a  point-and  appl\  fine  tion  <  ill*  d  display  values. 


In  this  exaniT  he  display-values  function  is  shown  in 
the  lower  left  window.  Display- values  is  being  applied  to 
the  Mabels4  image  which  consists,  not  of  numbers,  but  of 
pointers  to  all  the  objects  which  occupy  a  given  loc  ation  in 
that  image*.  This  is  possible*  because  images  arc*  dese  tibed 
as  abstract  types  which  can  combine  with  others  freely. 
Window  *d<3*  .shows  .some  curve's  extracted  from  the*  ever¬ 
present  mandrill  image.  When  tin*  mouse  was  clicked  in 
*(15*.  the  function  defined  in  miiddle  is  executed.  This 
function  highlights  the  curve  selected  in  *d5*.  finds  curve's 
that  are  within  Hi  pixels  of  tin*  selected  curve  and  then 
displays  the  curves  in  *dC*.  A  textual  description  of  some 
of  the  curves  is  shown  in  the  rightmost  Lisp  window. 

G.  FUTURE  DIRECTIONS 

There  arc*  several  ongoing  trends  in  hardware',  soft¬ 
ware*  and  vision  research  affecting  the'  future*  development 
of  image  understanding  environments.  All  these*  factors 
will  m, ike  H  Environments  less  expensive,  more  powerful 
and  more  common.  Tin-  main  driving  force  is  the  contin¬ 
ued  advance  of  hardware:  every  year.  machine's  have*  more 
storage  and  are  faster,  cheaper  and  easier  to  <-oiitn  ct  to¬ 
gether.  Optic  al  devices  are  making  the  storage  and  trans¬ 
mission  <>f  large  amounts  of  digital  imagery  feasible.  Two 
complementary  trends  in  vision  haidwan-  development  are 
worth  noting.  J  hep  a  ie  currently  several  ad  vain  e«l.  ex  pet  - 
imental  architectures  In  ng  built  specifically  for  machine 
vision,  e  g.  the  Warp  HW  WiST:.  11  A  W ’l.ST ■  and  ‘lie 

Colitiei  poll  Mac  hill'’  llihs.'ii.  Soft  w, -lie  illteifates  to  these 
mac  hines  aie  being,  built  and  will  eventu.illv  in<  «a pmai- 
P  eii\ ip  niiieiits.  lh«  ill  1  i<  i  hardware  te  nd  is  the  d- 
veh -pm.  hi  .  ,f  inexpensive  and  modulai  pieces  of  vision  .a 
x  1  si,  ai-(  ompatible  iiaplxx.,p-  ha  digit  i/at  i«  ai .  mtixuis.  and 
pixel -L  x<  1  <  -pi  a  a  I  ions.  In  i  he  dawning  <’ia  cl  tin-  p«  isiaial 
siipr  p  .  anputc  i  w<  a  kst  at nai.  w-  will  see  inP  g-  ii <  d  t  m  nk*  \ 
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vision  systems  built  out  of  these  components  for  both  in¬ 
dustry  and  for  mass  market  video  workstations.  We  are 
start  ing  to  see  a  minor  growth  industry  in  migrating  A I 
and  vision  software  to  SUNS  and  MAC  IIs. 

There  also  will  continue  to  be  advances  in  general  soft¬ 
ware  technology  related  to  the  development  of  Computer 
Assisted  Software  Engineering  (CASE)  tools  that  auto¬ 
mate  several  tasks  for  developing  and  maintaining  pro¬ 
grams.  CASE  tools  can  keep  track  of  code  and  comments 
through  databases  and  browsers  and  allow  the  automatic 
generation  of  some  of  the  code  from  a  specification  of  what 
needs  to  be  done.  Advances  in  compilers  allow  the  creation 
of  new  programming  languages  like  Ada  and  object  ori¬ 
ented  programming  systems  [Mey8T].  User  interfaces  will 
incorporate  these  general  developments  in  supporting  pro¬ 
gramming  in  workstation  environments  with  several  new 
types  of  interactive  devices  [Sci87j.  User  databases  will 
incorporate  advances  in  hypertext  [Con87]  technolog}-  for 
increased  flexibility. 

Currently  many  environments  are  being  used  as  the 
basis  of  intelligent  interactive  systems  for  image  analysis 
tasks  in  aerial  photo  and  medical  imaging  interpretation. 
In  these,  much  of  the  high-level  hypothesis  management 
and  inference  associated  with  an  autonomous  vision  sys¬ 
tem  is  done  bv  a  human  who  can  interactively  match  a 
model  to  an  image  with  a  small  number  of  interactive  op¬ 
erations.  The  user  interface  provides  for  the  automatic 
storage  and  cross  referencing  of  interpretations.  Examples 
are  extensions  to  Image  Calc  for  terrain  Interpretation, 
Power-vision  for  biomedical  applications,  and  the  KBVi- 
sion  system  for  various  applications.  IU  environments  will 
be  integrated  with  CAD/CAM  systems  to  provide  tools 
for  developing  workpiece  inspection  routines  as  part  of  the 
manufacturing  design  process. 

One  of  the  major  difficulties  with  the  library  approach 
to  IU  software,  in  addition  to  problems  with  extendibility. 
was  that  the  libraries  themselves  were  too  limited.  Though 
vision  is  too  complex  a  problem  to  have  a  uniform  solu¬ 
tion.  there  is  a  much  larger  body  of  techniques  than  was 
available  before.  This  will  be  reflected  in  modular  vision 
software  for  complex  operations.  Certain  portions  of  vision 
research  can  be  packaged  or  standardized  now:  constrained 
vision  systems,  stereo,  iconic  to  symbolic  mapping,  surface 
reconstruction  from  sparse  samples,  and  camera  modeling 
software. 
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ABSTRACT 

Matching  of  data  structures  such  as  digital  images  oi 
labeled  graphs  is  computationally  expensive,  because  it 
requires  node-by-node  comparisons  of  the  labels.  If  we 
have  probabilistic  models  for  the  classes  of  data  struc¬ 
tures  being  matched,  we  can  reduce  the  expected  compu¬ 
tational  cost  of  matching  ty  comparing  the  nodes  in  an 
appropriate  order.  This  paper  proves  some  general 
results  about  this  approach,  and  also  presents  experimen¬ 
tal  results  for  digital  images,  when  we  know  the  probabil¬ 
ity  densities  of  their  gray  levels,  or  more  generally,  the 
probability  densities  of  arrays  of  local  property  values 
derived  from  the  images. 

1.  INTRODUCTION 

Finding  matches  between  data  structures  is  a  basic 
operation  in  many  branches  of  computer  science.  In  this 
paper  we  deal  with  the  important  class  of  situations  in 
which  the  nodes  of  the  data  structures  have  labels  that 
are  elements  of  a  metric  space — for  example,  real 
numbers,  or  vectors  having  real  components.  In  such 
situations  we  can  define  matches  quantitatively,  in  terms 
of  the  distances  between  the  labels  of  corresponding 
nodes.  This  type  of  quantitative  matching  is  widely  used 
in  fields  such  as  image  processing,  computer  vision,  and 
pattern  recognition,  where  the  data  structures  might  be 
images  (arrays  of  numerical-  or  vector-valued  pixels)  or 
numerically  labeled  graphs  (where  the  nodes  might,  e.g., 
represent  segments  of  an  image  and  their  labels  might  be 
tuples  of  property  values  computed  for  these  segments). 
We  will  formally  define  the  quantitative  matching  prob¬ 
lem  in  Section  2. 

Finding  “close”  matches  between  two  data  struc¬ 
tures  is  an  expensive  process,  because  one  needs  to  con¬ 
sider  many  possible  correspondences  between  the  nodes  of 
the  two  structures,  and  for  each  correspondence,  to  com¬ 
pute  the  distances  between  the  labels  corresponding 
pairs  of  nodes,  node  by  node.  Some  work  has  been  done 
on  reducing  the  amount  of  computation  involved  in  this 
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process  by  taking  advantage  of  the  overlaps  between 
correspondences  (1);  but  we  will  not  pursue  this  approach 
here.  Rather,  we  will  consider  the  possibility  of  reducing 
the  expected  amount  of  computation  for  a  given 
correspondence,  by  computing  the  labels  of  corresponding 
pair  of  nodes  in  an  appropriated  chosen  order.  We  will 
show  how  we  can  define  such  an  order  when  we  are  given 
probabilistic  models  for  the  classes  of  data  structures 
being  matched. 

In  Section  3  we  prove  some  general  results  about 
how  to  order  the  comparisons  so  as  to  minimize  the 
expected  computational  cost.  In  Sections  4  and  5  we  give 
some  experimental  results  for  the  case  where  the  data 
structures  being  matched  are  digital  images. 

2.  QUANTITATIVE  MATCHING 

We  will  assume  from  now  on  that  the  data  struc¬ 
tures  being  matched  are  labeled  graphs;  note  that  digital 
images  can  be  regarded  as  the  special  case  in  which  the 
pixels  are  the  nodes  of  the  graph,  the  arcs  connect  pixels 
to  their  neighbors,  and  the  labels  are  the  pixel  values.  As 
already  indicated,  we  assume  that  the  labels  are  elements 
of  a  metric  space. 

A  correspondence  between  the  labeled  graphs  G  and 
H  is  a  graph  isomorphism  f  of  G  into  H .  Let  the  nodes 
of  G  be  f*|,  ....  Pn,  let  the  label  of  node  P  be  denoted 
by  L(P),  and  let  the  distance  between  two  labels  L  and 
M  be  denoted  by  d(L,M)-  Then  any  correspondence  / 
gives  rise  to  a  distance  vector  A/=(d(L(P(), 

M/(^|))) .  d[L(Pn),  L{f(Pn))  11- 

In  general,  the  degree  of  mismatch  associated  with 
the  correspondence  /  can  be  defines  as  ||Ay||,  where  ||  ||  is 
a  real-valued  norm.  A  t-match  between  G  and  H  is  a 
correspondence  f  such  that  | ( Ay 1 1  <  t.  Our  general  quan¬ 
titative  matching  problem  is  to  find  all  /-matches 
between  G  and  H.  for  a  given  value  of  (. 

We  shall  assume  here  that  ||  ||  is  “monotonic”  in  the 
sense  that,  for  any  vector  A  =  (x, . xm  )  of  nonnega- 
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tive  real  numbers  we  have  ||(xlt  .  .  .  ,  x,)||  <  ||(xi,  ■  •  •  . 
xm  )|  |  for  all  1  <  *  <  m.  [This  is  true  for  many  standard 


norms,  e.g.,  for 


|=  sup  x,, 

1  <  *  <  m  ,'_i 


Y]  x,-2,  and  so  on.]  This  assumption  allows  us  to 

reduce  the  computational  cost  of  finding  all  the  t- 
matches.  In  fact,  for  a  given  correspondence,  if  we  have 
already  computed  d(  L  (E,),  i(/(P,-)) )  for  a  subset  of 

the  »’s,  and  find  that  the  norm  of  this  tuple  of  d's 
already  exceeds  t,  we  need  not  compute  the  remaining 
u"o,  since  || A|-[|  will  surely  exceed  t .  For  concreteness, 
we  shall  assume  from  now  on  that  ||A||  =  [|(x1(  .  .  .  , 

i=i 

If  we  have  a  probabilistic  model  for  the  d's,  we  can 
further  reduce  the  expected  computational  cost  of  finding 
all  the  t-matches  by  computing  the  d's  in  an  appropriate 
order.  For  example,  suppose  that  for  each  L  we  know 
the  expected  value  of  d(L,L’)  (averaged  over  the  set  of 
all  L').  If  we  compute  the  d’s  in  order  of  their  expected 
size  (i.e.,  first  compute  d(  L  (P, ),  L  ( /(P,)) )  for  those 

L(Pi)’ s  for  which  the  expected  value  is  greatest,  then  for 
those  for  which  it  is  next  greatest,  and  so  on),  then  we 
can  expect  that  the  value  of  Ed(L(P,),  L(/(P,))  )  will 

increase  rapidly,  so  that  it  should  only  be  necessary  to 
compute  a  few  d’s  before  t  is  exceeded.  In  the  next  sec¬ 
tion  we  shall  establish  some  general  results  about  optimal 
ordering  schemes. 

3.  OPTIMAL  ORDERING  IN  MATCHING 

As  already  indicated,  if  we  know  the  expected  value 
of  d(L,L')  for  each  L ,  we  should  be  able  to  use  this 
information  to  determine  the  order  in  which  to  compute 
the  d's,  and  thereby  reduce  the  expected  number  of  d’s 
that  need  to  be  computed  before  the  threshold  t  is 
exceeded.  In  particular,  we  suggested  that  the  d's  should 
be  computed  in  decreasing  (more  precisely,  nonincreasing) 
order  of  their  expected  size.  In  fact,  this  is  not  always 
the  optimal  order,  i.e.,  this  order  does  not  necessarily 
minimize  the  expected  number  of  d’s  that  need  to  be 
computed.  In  this  section  we  present  some  algorithms  for 
determining  the  optimal  order  under  various  assumptions. 

We  first  give  a  simple  example  to  show  that  the 
decreasing  order  is  not  always  optimal.  Let  the  labels  be 
the  integers  0,  1,  and  2,  with  probabilities  L,  (,  and  «, 
respectively,  where  e  is  a  small  positive  number.  Then 
the  possible  values  of  d(L,L’)  are  0,  1,  and  2,  and  their 
probability  densities  and  expected  values  are  as  follows: 

Probability  of  d(L,  L')=  Expected  value 
L  0  12  ofd(L,Lri 


Thus  the  expected  d  value  is  greatest  for  L  =  0,  next 
greatest  for  L  =  1,  and  least  for  L  =  2.  On  the  other 
hand,  suppose  we  have  already  computed  a  set  of  d’s  and 
have  accumulated  a  value  of  Ed,  that  is  between  t-1 
and  I.  If  we  next  choose  a  node  with  label  L  —  0,  to 
maximize  the  expected  value  of  d,  then  with  probability 
A  we  will  get  d  =  0,  so  that  the  value  of  Ed,  will  still  be 
less  than  t .  On  the  other  hand,  if  we  choose  a  node  with 
label  L  —  1,  then  with  probability  arbitrarily  close  to  1 
we  will  get  d  =  1,  so  that  Ed,  will  be  at  least  t  and  we 
can  stop. 

3.1.  The  optimal  order 

The  problem  of  determining  the  optimal  order  can 
be  formulated  in  general  terms  as  follows:  Suppose  we 
have  a  set  /„  of  n  possible  operations,  denoted  by 
1,  ,  n,  each  of  which  produces  a  nonnegative  real 

value,  where  the  Ith  operation  has  cost  c,  and  its  value  is 
the  random  variable  e, .  (In  our  case,  the  operations  are 
the  computation  of  the  d’s,  i.e.,  e,  —  d(,),  and  they  all 
have  the  same  cost.)  Our  task  is  to  order  the  operations 
so  as  to  achieve  Ee,-  >  t  at  minimum  cost  Ec,-. 

A  state  of  the  process  can  be  fully  characterized  by 
the  accumulated  value  and  the  set  of  unused  operations. 
Let  the  set  of  all  possible  states  be  denoted  by  5. 

n 

Clearly  S  =  Sk ,  where  Sk  is  the  set  of  possible  states 
*=1 

at  the  kih  step  of  the  process.  It  is  easy  to  see  that 
Sk  =  (0,  .  .  .  ,  Ek  }  X  (all  possible  subsets  of  /„ 
of  size  n  ~k  +1} 

where  Ek  is  the  maximum  possible  accumulated  error  at 
the  £th  step.  Let  D  denote  the  set  of  all  possible  deci¬ 
sions.  Thus  D  =  /„  (J{A)  where  h  is  the  decision  to 
halt  the  process  and  «,  1  <  i  <  n,  is  the  decision  to  do 
operation  Also  let  E\o\  denote  the  accumulated  error 
and  A  [a]  the  set  of  available  operations  corresponding  to 
state  <7. 

Every  state  a  £5  has  the  following  two  decision 
table  entries: 

d  {a]  €  D ,  the  optimal  decision  at  state  a. 
c  jo),  the  expected  cost  of  a  process  starting  from 
state  a  and  making  optimal  decisions  hen¬ 
ceforth,  i.e.,  the  minimum  possible  expected  cost 
starting  from  a. 

We  now  give  a  backward  induction  method  for 
building  the  optimal  decision  table,  i.e.,  determining  the 
values  of  d[o).  It  is  easy  to  fill  in  the  values  of  d[o]  and 
c  [o]  for  o<E5„.  We  then  work  backwards  to  fill  in  the 
values  for  the  other  states. 

Basis: 

For  all  a  6  S„  do 

[Note  that  there  is  only  one  operation  (say  t)  available.) 
if  E[a]  >  t  then  d[a]  =  h  and  c  [o]  =  0 
else  d  [a]  =  i  and  c  \a]  —  c, 

Induction: 

[Assuming  that  we  are  given  d[oj  and  c  [o]  for  all  states 
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in  Sk  +  l  we  compute  them  for  all  states  in  5*.] 

For  all  <?6  Sk  do 

if  E\a]>  t  then  d  [aj  =  h  and  c  [cr]  =  0 
else 

e  [cr]  =  min,-  g  A  |ff|  (  E  Vi,aW)c  M  +  c.  I 

K€S,„  ) 

d[a ]  =  t  which  minimizes  the  above 
(where  Si(T  is  the  set  of  states  reachable  form  a  after 
applying  operation  i  and  pia(a')  is  the  probability  of 
reaching  cd  from  a  using  operation  «.  Note  that 
S,„C51+1  and  also  that  for  all  d  ESia,  p,a{<fi)  = 
Prob(e,  =E[<7']-£[a]).] 

Theorem  3.1.  The  decision  table  is  constructed  in  this 
way  is  optimal. 

Proof:  We  prove  optimality  inductively.  Clearly,  for 
<7  €  S„ ,,  the  decision  d[a]  is  optimal  and  c\a]  corresponds 
to  the  minimum  expected  cost.  Now  assume  we  know 
the  optimal  decisions  and  the  corresponding  minimum 
expected  costs  for  all  states  in  S  ,  k  +  1  <  j  <  n .  For 
every  <?ESk  we  pick  the  decision  which  gives  the  least 
expected  cost.  Thus  the  decision  for  the  current  state  a 
is  optimal.  ■ 

3.2.  The  optimal  static  order 

ir.  general,  the  optimal  decision  at  each  stage  will 
depend  on  the  outcomes  of  the  previous  operations.  Thus 
the  above  decision  algorithm  is  an  adaptive  decision  stra¬ 
tegy.  We  now  consider  the  subclass  of  “static”  decision 
strategies,  in  which  the  operations  are  done  in  a  fixed 
order  irrespective  of  the  outcomes  of  the  previously  done 
operations.  The  optimal  static  decision  strategy  is  the 
best  one  can  do  in  applications  where  the  results  of  indi¬ 
vidual  operations  are  not  observable. 

A  static  decision  strategy  for  a  set  of  operations  can 
be  represented  by  the  ordering  that  it  induces  on  the  set. 
Let  7r  be  an  ordering  of  the  elements  of  /„  For  simplicity 
we  will  assume  from  now  on  that  all  operations  have  unit 
cost. 

Theorem  3.2.  The  expected  cost  of  the  static  decision 
strategy  defined  by  n  is 

C'M  =  1  +  E  Prob(  <  t\ 

j=  2  U=1  < 

Proof:  Let  jV  be  the  random  variable  denoting  the 
number  of  operations  to  be  performed  before  halting. 
Then  C\i r]  is  given  by 

E  j  Prob(N  —j)=  E  Prob(/V  >  j) 

;-l  j  —  l 

=  1+5]  Probf  e*(k)  <  <1 

j= 2  <*= I  I 

by  definition  of  N .  m 


Corollary  3.3.  Let  n *  be  the  optimal  ordering  of  In. 
Then  the  ordering  7r*'  such  that  )  =  n'(j )  for  all 
1  <  j  <  n  -  1  is  the  optimal  ordering  of  /„  -  {7T*(n )}. 

Proof:  This  follows  from  the  additive  form  of  the 
expected  cost  derived  in  Theorem  3.2.  ■ 

The  general  idea  of  our  method  of  determining  the 
best  static  decision  strategy  is  to  compute  the  optimal 
ordering  for  small  subsets  of  /„  and  work  up  inductively 
to  larger  subsets.  For  any  subset  J  of  /„ ,  let  tt\J\  be  the 
optimal  ordering  of  J  and  c  [7]  its  cost.  Let  the  probabil¬ 
ity  density  of  e;  be  /;;  that  of  5]  ej  be  and  let  the 

jeJ 

probability  distribution  of  5]  e;  be  G j,  so  that  Gj(x)  = 

X-, 

E  ?■/(•?)• 

y=o 

Basis: 

For  all  sets  J  C  I  having  just  one  operation  (say  k ) 
do 

[Compute  the  density  function.] 

9j=fki 

[Compute  the  orderings  and  the  costs.] 

7t[i](l)  =  k  and  e  [/]  =  1 
Induction: 

[We  assume  that  we  have  the  optimal  orderings  and 
costs  for  all  subsets  of  size  j  and  we  compute  them 
for  those  of  size  j  +  1.) 

For  all  sets  J  C  /  of  cardinality  j  +  1  do 
[Compute  the  distribution  function.) 

Let  k'  be  any  element  of  J. 

9j  =  9j  -{k'}®!k’ 

where®  is  the  convolution  operator. 

[Compute  the  orderings  and  the  costs.) 
c[J)  =  min*€y  c[J-  {*})+  Gj,k}(t) 
*\J}(j')=nlJ-{k"}\(j'){oTl<j)<j 
and  7r[j  +  l)  =  k" 

[where  k"  is  the  element  which  minimizes  the 
previous  expression.) 

Theorem  3.4.  The  above  algorithm  computes  the 
optimal  static  decision  strategy  in  o(n  2")  steps. 

Proof:  When  the  set  J  has  just  one  element  the  ordering 
rr (/]  is  trivially  optimal  and  the  optimal  cost  is  1.  Now 
assume  that  we  have  the  optimal  orderings  and  the 
corresponding  costs  for  sets  of  size  j.  Consider  a  set  J  of 
cardinality  j  +  1.  The  last  element  of  the  optimal  order¬ 
ing  has  to  be  one  of  the  j  +  1  elements  in  J .  By  Corol¬ 
lary  3.3  we  know  that  the  remaining  elements  must  also 
be  optimal.  Thus  we  have  only  j  +  1  possible  orderings 
as  candidates  for  the  optimal  ordering  for  J .  Of  these  we 
choose  the  one  with  the  least  cost  as  7r[/].  Then  evi¬ 
dently  jr[7]  is  optimal.  Note  that  evaluating  the  costs 
involves  determining  the  new  probability  density  function 
which  can  be  gotten  by  convolving  the  probability  den¬ 
sity  of  one  of  the  elements  of  J  with  the  density  function 
of  the  sum  of  the  remaining  elements,  which  is  obtained 
from  the  previous  inductive  step. 


puted  in  the  minimization  step  of  the  algorithm. 
Then  we  have 


If  J  has  j  elements,  the  number  steps  for  computing 
the  minimum  and  determining  the  error  (including  the 
convolution)  is  proportional  to  j.  Also  the  number  of 


sets  of  size  j  is  J  .  J .  Thus  the  total  number  of  steps  is 
of  the  form 


where  K  and  K'  are  constants.  ■ 

Note  that  o(n2")  is  a  considerable  saving  over  the  brute 
force  search  algorithm  which  involves  more  than  «!  steps. 
However,  the  algorithm  still  takes  an  exponential  number 
of  steps.  We  may  not  be  able  to  do  better  as  this  prob¬ 
lem  is  likely  to  be  NP-hard. 


3.3.  The  decreasing  order 

The  set  of  operations  /„  will  be  called  orderable  if 
there  is  an  ordering  tr  such  that  for  all  1  <  j  <  k  <  n 
and  for  all  t',  Prob(e^)  >  <')  >  Prob(e^t)  >  t').  In  this 
section  we  prove  that  if  this  condition  holds,  then 
irrespective  of  the  current  accumulated  error  the  choice 
of  operation  7r(j  )  is  always  better  than  the  choice  of  n(k), 
i.e.,  the  ordering  n  is  optimal. 

Note  that  there  are  natural  applications  where  the 
orderability  criterion  is  satisfied.  In  many  application 
domains  either  there  is  an  error  or  no  error,  i.e.,  it  may 
not  be  possible  to  grade  the  errors  quantitatively.  In 
these  case  the  random  variables  e,  take  binary  values  and 
it  can  be  seen  that  any  such  set  of  operations  is  order- 
able.  In  fact  in  this  case  the  order  is  simply  the  order  of 
decreasing  expected  error. 


Theorem  3.5.  Let  the  set  of  operations  /„  be  orderable 
and  7r  be  the  corresponding  order.  Then  the  static  order¬ 
ing  n  is  optimal  even  among  adaptive  strategies. 

Proof:  Our  aim  will  be  to  show  by  induction  that  the 
optimum  decision  table  of  Section  3.1  follows  this  static 
ordering.  Without  loss  of  generality  we  can  assume  that 
the  ordering  7r  is  the  ascending  numerical  order,  i.e.. 
7f(j  )  =  j  for  all  1  <  j  <  n.  The  theorem  can  then  be  res¬ 
tated  as  follows: 

T:  For  all  states  a€.fc>,  if  [rr]  <  t  then  d  [a]  =  the 
smallest  element,  in  A  [rrj .  else  d[o]  —  h . 

Let  us  denote  the  optimal  cost  at  a  by  c\E\o\.A  (er) ] .  It 
is  clear  that  for  all  such  that  E  [<t]  >  I ,  d\a]  is 

assigned  h  by  the  algorithm  of  Section  3.1.  Therefore  we 
need  only  consider  a' s  such  that  E\a\  <  t .  As  before,  let 
$,  be  the  set  of  states  having  n  i  +  I  available  opera¬ 
tions.  We  have  the  following  cases: 

a)  If  .-I  [<t]  has  only  one  element.  It  is  easy  to 

see  that  the  basis  part  of  the  algorithm  assigns  this 
element  to  d \cr].  Thus  T  is  trivially  true. 

b)  If  a  £  S„  ,,  let  A  [rrj  =  ( i.j  }  and  t  <  j .  Let 
E\a}  =  t'  where  l'<  t.  Let  r,  represent  the 
minimum  expected  cost  given  that  choice  i  is  made 
at,  a  and  similarly  for  c] ,  These  are  the  costs  com¬ 


c,  —  1  4-  Prob(e,  <  t  t') 

This  is  clear  because  the  process  stops  only  if  the 
threshold  t  is  exceeded.  Similarly 

Cy  =  1  -h  Prob(e;  <  t  -  t') 

From  the  orderability  condition  and  i  <  j  wc  have 
C|  <  Cj.  Thus  i  is  an  optimal  choice  at  a ,  and  T  is 
true. 

c)  If  cr£Sj,  j  <  n  1,  we  proceed  as  follows:  We 
assume  that  T  is  true  for  all  slates  in  Sj, 
j  +  1  <  ]'  <n,  and  show  that  T  holds  foi  .Tfc  5,. 
Let  E\v\  =  t\  t'  <  t .  Let  i  be  the  smallest  element 
in  A  Jct)  and  let  j  be  any  other  element.  We  will 
show  that  c,  <  r;  where  c,  and  c,-  are  as  defined 
earlier.  For  convenience  we  write  A  [a]  as  { ij, o } 
where  o  represents  the  remaining  elements  of  the 
set,  and  we  denote  t  -  t' -  1  by  r.  Then 

f,  =1  +  E  c\t'+  x,{],o}\f,(x) 

i  —0 

<1+  E  E  cit'  +  x+x'AoWjWfAx) 

z—  0  x'=0 

+  Prob(e,  <  r) 


since  c\t'-hx,  { y>}|  <  1+  E  rll'd-xTr',{o})/J(x') 

r'=0 

This  inequality  follows  from  the  fact  that 
c\t'  +  x,  {j.o  }]  is  the  optimal  cost.  Similarly 

<V  =  1  +  E  + 

=  1+  E  E  c[t'+x+x'.{o}}f,(x%(x) 

r  —  0  z'-O 

+  Prob(cJ  <  r) 

Note  that  the  equality  in  the  last  equation  is  due  to 
our  induction  assumption  that  T  holds  for  the  states 
in  5J4|.  We  know  that  i  which  is  the  smallest  ele¬ 
ment  in  {i,o}  is  the  optimal  choice  and  hence  the 
equality.  Observe  that  the  summations  in  the  last 
two  equations  are  equal.  Thus  from  these  equations 
and  the  orderability  condition  we  have  r,  <  cJ  for 
any  j  in  A  [er).  Thus  i  is  an  optimal  choice  at  state 
a.  so  that  T  holds  for  a.  m 

Corollary  3.8.  If  e, ,  1  <  i  <  n  are  binary-valued  then 
the  order  of  decreasing  expected  error  is  optimal. 

3.4.  Application  to  quantitative  matching 

To  apply  the  results  of  this  section  to  the  quantita¬ 
tive  matching  of  labeled  graphs,  we  need  to  treat  the  dis¬ 
tance  values  d {  .  )  [between  the  labels  of  corresponding 
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nodesj  as  random  variables.  We  can  do  this  by  consider¬ 
ing  the  set  of  all  possible  correspondences  between  G  and 
II,  so  that  a  given  node  of  G,  say  with  label  L ,  can 
correspond  to  any  (randomly  chosen)  node  of  //,  say  with 
label  L',  and  d(L,L')  becomes  a  random  variable.  If  we 
now  use  an  optimal  matching  order  to  minimize  the 
expected  number  of  steps  needed  to  exceed  the  threshold 
t,  then  when  we  try  all  the  correspondences  of  G  with  // 
(in  order  to  find  all  the  1-matches),  we  will  eliminate  the 
mismatches  quickly,  and  thus  minimize  the  expected 
computational  cost  of  the  entire  process. 

As  an  important  example,  suppose  that  the  node 
labels  are  real  numbers,  and  that  the  labels  on  the  nodes 
of  //  are  randomly  assigned  with  a  given  probability  den¬ 
sity.  Then  the  distance  between  a  given  label  L  (on  a 
node  of  G )  and  an  arbitrary  label  L'  (on  a  node  of  II) 
becomes  a  random  variable,  and  its  probability  density 
can  be  immediately  computed,  since  for  any  d  we  have 
Prob(d)  =  Prob(  L1  =  (L  +d))  +Prob(f,'  =  (L  -d)). 
Thus  the  results  of  this  section  allow  us  to  define  optimal 
orderings  for  the  qualitative  matching  of  graphs  having 
random  (numerical)  node  labels. 

If  G  and  II  are  digital  images,  the  node  labels  are 
just  the  pixel  gray  levels.  Thus  when  an  ensemble  of 
images  is  modeled  as  a  random  field,  i.e.,  when  we  specify 
the  probability  density  of  the  gray  levels,  an  image  from 
that  ensemble  is  a  special  case  of  a  graph  having  random 
node  labels.  In  the  next  section  we  show  how  ordered 
matching  reduces  the  expected  computational  cost  of 
matching  images  when  we  know  their  gray  level  probabil¬ 
ities. 


4.  ORDERED  IMAGE  MATCHING 

In  this  section  we  assume  that  G  and  H  are  digital 
images,  where  G  (the  “template")  is  m  X  m  and  II  (the 
“image”)  is  n  X  n.  Let  the  gray  levels  of  the  pixels  in  G 
be  denoted  by  z,  (1  <  i  <  mz),  and  for  a  given 
correspondence  of  G  with  II,  let  the  gray  levels  of  the 
corresponding  pixels  of  H  be  denoted  by  u>, 

(1  <  «  <  m2).  Thus  the  match  error  for  the  given 
m7 

correspondence  is  ^  ki  -  u>,  |.  We  want  to  order  the  pix- 


correspondence  is  ^  ki  -  tt>,  |.  We  want  to  order  the  pix- 

i  — 1 

els  of  G  so  that,  when  we  compute  the  terms  of  the  sum 
in  that  order,  we  minimize  the  expected  number  of  terms 
that  need  to  be  computed  before  the  threshold  I  is 
exceeded. 


In  this  section  we  will  primarily  be  interested  in 
computing  the  terms  in  decreasing  order  of  expected 
difference.  Let  the  image  II  come  from  an  ensemble  in 
which  the  probability  of  gray  level  w  is  p(w).  [We  can 
estimate  the  p’s  by  computing  the  histogram  of  H\  if  II 
is  n  X  n,  and  there  arc  h(w)  pixels  with  gray  level  in, 
then  h(u>)/n 1  is  an  estimate  of  ;>(«>)•)  Then  the  probabil¬ 
ity  of  obtaining  difference  d  for  a  template  pixel  with 
gray  level  z  is  q(d\z)  —  p(;  rf)+p(z  +  rf).  The 
expected  difference  is  thus  e  (z )  =  {d  |z ).  Our  stra¬ 
tegy  is  to  first  compute  |z,  I  for  those  z’s  for  which 
e(z)  is  greatest,  then  for  those  z’s  for  which  it  is  next 
greatest,  and  so  on,  until  I  is  exceeded  or  until  all  rn  “ 


differences  have  been  computed. 

[As  we  saw  in  Section  3,  this  strategy  does  not 
always  minimize  the  expected  computational  cost  (=  the 
number  of  differences  that  need  to  be  computed  before  t 
is  exceeded),  though  it  evidently  does  maximize  the 
expected  value  of  E|z,  ui,  |  for  any  given  number  of 
terms  of  the  sum.  We  also  saw  in  Section  3  that  if  the 
images  are  binary,  the  descending  order  does  minimize 
expected  computational  cost.  Note  that  in  the  binary 
case,  it  is  easy  to  compute  e(z)  explicitly.  In  fact,  let  the 
probabilities  of  1  and  0  be  p  and  1  -  p.  The  only  possible 
values  of  d  are  0  and  1,  and  evidently  we  have 
<?(l|l)  =  p(0)=  1 -p,  ?(0|l)  =  p(l)  =  p,  ?(l|0)  =  p(l) 
=  p,  and  g  (0|0)  =  p(0)  =  1  -  p.  Thus  e(l)  =  l-p  and 
e(0)  =  p,  so  that  the  descending  order  simply  means  that 
if  p  >  we  first-  compute  |z,  -  w,  |  for  all  the  0’s  in  the 
template,  while  if  p  <  -L  we  first  compute  it  for  all  the 

l’s-1 

The  decreasing  order  may  or  may  not  be  optimal, 
but  even  if  it  is,  the  theory  of  Section  3  does  not  tell  us 
how  much  better  it  is  than  (say)  random  order.  In  the 
remainder  of  this  section  we  present  experimental  results 
that  show  the  improvement  obtained  when  the  decreasing 
order  is  used. 

We  should  first  point  out  that  there  are  ca-es  in 
which  using  the  decreasing  order  yields  no  advantage;  in 
fact,  there  are  cases  where  e(z)  is  constant  for  all  z,  so 
that  no  order  yields  a  greater  expected  rate  of  increase 
than  any  other.  Specifically,  let  the  grav  level  range  be 
[0,Af],  and  let  p(w)  —  j  for  0  or  At,  and  p(w)  =  D  other¬ 
wise.  Then  evidently  for  any  z  we  have  |z  -  te|  =  z  with 
probability  4  and  At  -  z  with  probability  d.,  so  the 
expected  value  of  |z  w\  is  e  (z )  =  4  +  =  ~  for  all 

z.  At  the  other  extreme,  if  p(te)=l  for  w—0  and 
p(u>)  =  0  otherwise,  then  for  any  z  we  have  |z  -te|  =  z 
with  probability  1,  so  the  expected  value  of  |z  w|  is  z, 
which  can  range  anywhere  between  0  and  At  (and  simi¬ 
larly  if  p(u>)  =  1  for  w  —  M).  [On  the  other  hand,  if, 
e.g.,  p(u»)=  1  for  w  =  Y’  then  e(z)  can  range  between  0 
and  -[d;  and  similarly  for  other  cases  in  which  p(tt>)=  1 
for  a  specific  value  of  u'.[  As  a  final  example,  let  p(u>)  be 
constant  for  ail  w,  then  evidently  the  expected  value  of 
|z  u’|  is  when  z  =0  or  At,  and  df  when  z  —  while 
it  takes  on  intermediate  values  for  other  z’s,  so  that  it 
ranges  between  —  and  These  examples  suggest  that 
using  decreasing  order  is  most  advantageous  when  p(ti') 
is  sharply  peaked  at  a  single  value;  also  somewhat  advan¬ 
tageous  when  p(ie)  is  uniform;  and  not  advantageous 
when  p(u')  has  two  equal,  widelv  separated  peaks. 

These  observations  are  borne  out  by  the  examples 
shown  in  Figures  1  r>.  In  all  of  these  figures  both  the 
image  and  the  template  have  16  gray  levels,  and  the  size 
of  the  template  is  16  X  16.  The  image  and  template  are 
both  randomly  generated  to  have  (approximately)  given 
histograms.  Each  figure  shows  the  advantage  of  decreas¬ 
ing  order  over  random  order  in  terms  of  the  fraction  of 
total  mismatch. 
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In  Figures  1  and  2  p(w)  is  uniform;  in  Figure  1  the 
template  also  has  a  uniform  gray  level  probability  density 
while  in  Figure  2  its  probability  density  is  peaked  at  a 
single  value  near  the  center  of  the  gray  level  range.  In 
both  cases,  ordered  matching  is  advantageous,  but  the 
advantage  is  not  great  even  in  Figure  1,  and  it  is  negligi¬ 
ble  in  Figure  2.  This  is  because  in  the  Figure  1  case  the 
template  has  many  pixels  at  the  ends  of  the  gray  level 
range,  where  the  expected  difference  value  is  M /2,  and 
when  we  do  ordered  matching  we  can  compute  these 
differences  first  to  get  a  relatively  fast  increase  in  the 
mismatch;  but  in  the  Figure  2  case  the  template  pixels 
are  nearly  all  in  the  middle  of  the  gray  level  range,  so 
that  in  both  ordered  and  random  matching  we  have  no 
choice  but  to  use  them.  To  estimate  the  advantage  of 
ordered  over  random  matching  in  the  Figure  1  case,  sup¬ 
pose  t  corresponds  to  40%  of  the  total  mismatch.  Then 
if  we  use  random  order,  we  cannot  expect  to  exceed  t 
until  40%  of  the  pixels  have  been  compared;  but  if  we 
use  decreasing  order,  we  can  expect  to  exceed  it  when 
about  30%  of  the  pixels  have  been  compared.  Thus  even 
in  this  case,  ordered  matching  reduces  the  expected  com¬ 
putational  cost  by  25%. 

As  expected,  the  advantages  of  ordered  matching  are 
greater  in  the  case  shown  in  Figure  3,  where  the  template 
has  a  uniform  gray  level  probability  density,  but  the 
image’s  density  is  peaked  at  the  middle  of  the  range. 
Here,  if  we  use  decreasing  order,  we  exceed  t  —  40% 
when  less  than  25%  of  the  pixels  have  been  matched;  but 
if  we  use  random  order,  we  do  not  exceed  it  until  40%  of 
the  pixels  have  been  matched.  Thus  the  cost  saving  in 
this  case  is  nearly  50%. 

The  saving  is  ever  more  impressive  in  the  case  shown 
in  Figure  4.  Here  the  image  and  template  both  have  den¬ 
sities  that  are  peaked  in  the  middle  of  the  range.  Thus 
when  we  use  random  order,  the  differences  are  low;  but 
when  we  use  decreasing  order,  we  first  compute  the 
differences  for  those  template  pixels  that  are  near  the 
ends  of  the  range,  so  that  we  rapidly  accumulate  high 
differences.  In  this  example,  when  we  use  decreasing 
order  we  exceed  t  =  40%  when  only  about  17%  of  the 
pixels  have  been  matched,  but  when  we  use  random  order 
we  need  to  match  over  40%  of  the  pixels,  so  that  the  cost 
saving  is  about  60%.  Note  that,  as  we  see  in  Figure  5, 
when  the  image  and  template  both  have  peaked  densities 
but  the  peaks  are  far  apart,  the  advantage  of  using 
decreasing  order  is  not  great;  here  the  random  order  also 
leads  to  a  rapid  accumulation  of  mismatch,  since  the 
image  and  template  gray  levels  tend  to  be  far  apart. 

These  results  illustrate  the  computational  savings 
that  can  be  obtained  by  taking  advantage  of  knowledge 
about  the  gray  level  probabilities  in  the  image  (and  tem¬ 
plate).  In  the  next  section  we  will  show  how  knowledge 
about  the  spatial  arrangement  of  the  gray  levels  (in  par¬ 
ticular,  about  probability  densities  of  local  property 
values)  can  be  nserl  to  yield  further  improvements 


5.  LOCAL  PROPERTY  MATCHING 

As  we  saw  in  Section  4,  for  some  types  of  gray  level 
probability  densities,  there  is  little  or  no  advantage  to 
using  ordered  matching.  In  particular,  if  the  image  con¬ 
tains  primarily  low  and  high  gray  levels,  e  (z )  is  approxi¬ 
mately  constant  for  all  z.  We  can  sometimes  improve 
this  situation  by  computing  a  suitably  chosen  local  pro¬ 
perty  at  each  pixel  of  the  template  and  of  the  image. 
This  converts  the  image  into  a  local  property  array  which 
may  have  a  more  nearly  unimodal  probability  density  of 
values.  We  can  then  replace  the  original  matching  prob¬ 
lem  by  the  problem  of  matching  the  local  property  arrays 
derived  from  the  template  and  the  image,  and  ordered 
matching  can  be  advantageously  used  for  this  new  prob¬ 
lem. 

We  can  obtain  a  local  property  array  that  is  more 
nearly  unimodal  than  the  original  image  by  taking  advan¬ 
tage  of  local  gray  level  dependencies  in  the  image.  In 
other  words,  if  we  know  the  second  order  gray  level  pro¬ 
babilities  for  the  image  (i.e.,  the  joint  probabilities  of 
given  pairs  of  gray  levels  occurring  in  given  relative  posi¬ 
tions),  we  can  compute  functions  of  the  gray  levels — in 
particular,  local  properties — that  can  be  expected  to 
have,  say,  high  values.  In  this  section  we  give  several 
illustrations  of  how  this  approach  can  be  used  to  generate 
derived  arrays  in  which  ordered  matching  is  more  advan¬ 
tageous  than  it  was  for  the  original  images. 

It  should  be  realized,  of  course,  that  finding  good 
matches  to  a  local  property  array  is  not  equivalent  to 
finding  good  matches  to  the  original  template.  If  a  given 
correspondence  yields  a  good  match  between  the  template 
and  the  image,  it  also  yields  a  (relatively)  good  match 
between  the  local  property  arrays  derived  from  them. 
[For  example,  suppose  the  local  property  is  of  the  form 

f'(x,y)  =  E  c1/(ul-,ti1),  where  the  /’ s  are  gray  levels  and 
1  =  1 

the  u,  ,v,  are  neighbors  of  (x,y).  If  |ff(*»t>)-A(u,t>)|  <a 
for  all  (u,t>),  where  g  and  h  denote  the  pixel  gray  levels 
in  G  and  H,  respectively,  then  \g'(x,y)~  h\x,y  )|  = 

k  k 

E  c<  ( p  («,■  ,v,- ) -  a  (m,-,v,- ))  <  E 

i=i  i=i 

k 

<  a  E  c,-l  On  the  other  hand,  a  good  match  between  the 

i  =  ] 

local  property  arrays  does  not  imply  a  good  match 
between  the  template  and  the  image.  [As  a  very  simple 
example,  let  the  local  property  be  defined  by 
f'(x,y)  =  f(x  +  l,y)-/(x,y),  and  suppose  that 

g  (u,v)=  h(u,v)+  c  for  all  u,v\  then  g'(x,y)  =  h’(x,y )  for 
all  x,y  (since  the  c  cancels),  so  the  local  property  arrays 
match  perfectly,  whereas  the  original  image  and  template 
have  a  mismatch  of  E  (  9  (u'v )  ~  h  (*>v ))  =  m2c  .] 

U.V 

It  may  nevertheless  be  advantageous  to  find  good 
matches  between  the  derived  property  arrays  of  the 
image  and  template.  These  matches  are  guaranteed  to 
include  the  good  matches  between  the  image  and  tem¬ 
plate  themselves.  If  good  matches  are  rare  (as  is  usually 
the  case),  it  is  inexpensive  to  test  the  correspondences 
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that  give  good  matches  between  the  property  arrays  to 
verify  which  of  these,  if  any,  give  good  matches  between 
the  image  and  template.  At  the  same  time,  by  using 
ordered  matching  on  the  derived  arrays  to  reduce  the 
expected  computational  cost,  we  may  have  greatly 
decreased  the  average  amount  of  computation  that  needs 
to  be  done  at  every  pixel. 

The  advantage  of  matching  local  property  arrays  is 
illustrated  in  Figures  6-8.  In  all  of  these  figures,  the  tem¬ 
plate  and  image  both  have  four  gray  levels  with  (approxi¬ 
mately)  equal  probabilities  (i.e.,  their  gray  level  probabil¬ 
ity  densities  are  uniform).  [We  recall  that  in  this  case  the 
advantage  of  using  ordered  matching  was  not  great.]  Fig¬ 
ure  6  shows  two  such  images,  in  one  of  which  the  joint 
probabilities  of  similar  gray  levels  are  high  for  pairs  of 
pixels  that  are  horizontal  neighbors,  while  in  the  other 
they  are  high  for  vertically  neighboring  pairs.  The  local 
property  used  is  f'{x,y)  =  f(x  +  l,y )-f(x,y).  (Note  that 
the  local  property  arrays  have  both  positive  and  negative 
values.)  The  template  (of  size  16  X  16)  is  a  portion  of  the 
image  in  Figure  6a;  in  Figure  7  we  show  results  when  it  is 
matched  to  another  portion  of  the  same  image.  Here  the 
histograms  of  the  property  value  arrays  both  have  sharp 
peaks  in  the  middle  of  the  value  range  (i.e.,  at  0),  and  as 
expected  from  the  case  of  Figure  4,  the  saving  resulting 
from  using  the  decreasing  order  is  substantial.  On  the 
other  hand,  Figure  8  shows  results  when  the  same  tem¬ 
plate  is  matched  to  a  portion  of  the  image  in  Figure  6b. 
Here  the  histogram  of  the  property  value  array  obtained 
from  the  image  is  much  flatter,  and  as  a  result,  the  sav¬ 
ing  is  not  as  great  (compare  Figure  3). 

A  final  example  is  shown  in  Figures  9-10.  Here  the 
image  and  template  both  consist  of  pure  “salt-and-pepper 
noise”,  i.e.,  two  gray  levels  (“black”  and  "white”)  occur¬ 
ring  with  equal  probabilities.  This  means  that  their  his¬ 
tograms  consist  of  two  spikes  of  equal  height  located  at 
the  ends  of  the  gray  scale,  so  that  ordered  matching 
should  yield  no  advantage  at  all;  this  is  confirmed  by  the 
graphs  in  Figure  9.  In  this  case,  whether  we  use  ordered 
or  random  matching,  to  exceed  t  =  40%  we  must  match 
40%  of  the  pixels.  The  improvement  is  dramatic  when 
we  replace  the  image  and  template  by  local  property 
arrays;  in  this  case  we  have  used  the  local  property 

}'[x,y)  =  max[j/(z  +  l,y)-f(x,y)\,  \f{x  - \,y )-f(x,y )|, 
\f(x,y  +  l)-f(x.y)\,  | f(x,y  -  l)-/(x.y)|) 

Evidently,  f'{x,y)  will  have  value  0  with  probability 
about  -T,  and  value  >  with  probability  about  -i|.  Thus 
its  histogram  has  a  major  peak  at  the  high  end  of  the 
gray  scale  (and  a  much  smaller  one  at  the  low  end),  so 
that  ordered  matching  should  be  very'  advantageous. 
This  is  borne  out  by  the  graphs  in  Figure  10,  where  we 
see  that  using  ordered  matching  allows  us  to  exceed 
t  —  40%  when  only  a  little  over  5%  of  the  pixels  have 
been  matched — an  improvement  by  a  factor  of  nearly  8. 

The  results  of  this  section  show  that  knowledge 
about  second  order  gray  level  probability  densities,  or 
about  the  (first-order)  probability  densities  of  local  pro¬ 


perty  values,  can  yield  further  reductions  in  the  expected 
computational  cost  of  matching.  It  will  be  recalled  that 
these  types  of  probabilistic  information  are  commonly 
used  to  characterize  image  textures  [2],  Thus  this  section 
illustrates  how  we  can  reduce  the  expected  cost  of  image 
matching  if  we  know  the  textures  of  the  images  that  are 
being  matched. 

In  the  example  given  in  this  section,  the  local  pro¬ 
perty  used  was  simply  a  directional  difference  of  gray  lev¬ 
els.  Other  local  properties  could  be  used,  as  appropriate, 
in  other  cases.  It  is  of  interest  to  recall  [3]  that  matching 
of  image  differences  (or  derivatives),  rather  than  the 
images  themselves,  may  yield  an  optimal  signal-to-noise 
ratio  (between  matches  and  mismatches)  when  the  images 
are  “busy”.  Our  results  show  that  matching  of 
differences  may  have  additional  advantages  in  reducing 
the  expected  computational  cost  of  the  matching  process. 

6.  CONCLUDING  REMARKS 

The  basic  idea  of  using  ordered  matching  to  reduce 
expected  cost  was  proposed  by  Nagel  and  Rosenfeld  over 
15  years  ago  [4].  This  paper  has  developed  a  general 
theoretical  framework  for  this  approach,  and  has  also 
shown  how  to  obtain  further  improvements  by  matching 
local  property  arrays.  We  have  seen  that  substantial  cost 
savings  (sometimes  of  50%  or  more)  can  be  obtained  in 
this  way.  As  shown  in  [4],  the  savings  can  more  than 
compensate  for  the  initial  cost  of  ordering  the  template 
pixels. 

Oiher  methods  of  reducing  the  cost  of  image  match¬ 
ing  have  been  proposed— for  example,  methods  that  make 
use  of  multiple-resolution  representations  of  the  images. 
We  plan  to  investigate  the  extension  of  our  approach  to 
the  multiresolution  case  in  a  subsequent  paper. 
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Abstract 

Decision  trees  have  been  used  for  many  years  in  pattern  recog¬ 
nition  systems.  However,  their  use  in  visual  recognition  has 
been  restricted  to  the  pattern  recognition  domain,  where  fea¬ 
tures  are  required  to  be  measurements  which  form  the  axes 
of  a  vector  space,  and  which  are  often  assumed  independent. 
Computer  vision  researchers  have  studied  the  use  of  more 
powerful  recognition  techniques  that  consider  object  topol¬ 
ogy  and  viewpoint  consistency,  among  other  things.  How  to 
structure  a  large  database  for  efficient  recognition  when  us¬ 
ing  these  techniques  has  not  been  well  studied.  Here  I  claim 
that  the  decision  tree  data  structure  can  be  employed  as  suc¬ 
cessfully  outside  of  the  pattern  recognition  domain  as  it  has 
been  within.  To  substaniate  my  claim  I  describe  a  system  for 
recognition  from  a  large  database  that  recognizes  polyhedra 
in  crowded  scenes  and  uses  object  topology,  comparisons  of 
length  and  angle,  and  viewpoint  consistency  as  tests. 

Given  the  a  priori  probability  of  polyhedra  to  appear  in  the 
image,  viewpoints  from  which  they  are  seen,  and  the  errors 
which  occur  in  detecting  their  edges,  the  decision  tree  is  au¬ 
tomatically  constructed  from  the  database  using  the  criterion 
of  minimizing  the  expected  recognition  time.  Since  the  opti¬ 
mal  solution  of  this  problem  is  intractible,  a  greedy  method 
is  used  based  on  minimizing  the  expected  entropy  of  the  a 
posteriori  distribution  of  possible  matches.  The  motivation 
for  the  entropy  measure  is  derived. 

Since  object  topology  is  a  major  source  of  information  for 
the  recognition  system,  the  space  of  viewpoints  is  divided 
according  to  principal  view  regions.  The  large  number  of 
regions  that  this  results  in  does  not  reduce  the  run-time  effi¬ 
ciency  of  the  system,  since  any  number  of  regions  can  be  con¬ 
sidered  at  one  time  by  the  recognition  system,  whose  state 
is  determined  by  a  position  in  the  decision  tree.  Other  mea¬ 
surements  are  also  used,  for  example  angle  measurements  and 
comparisons  of  length,  where  they  will  provide  more  informa¬ 
tion  than  the  topological  data.  The  viewpoint  consistency 
constraint  is  used  to  provide  an  exact  viewpoint  determina¬ 
tion  as  soon  as  is  economical  from  information-theoretical 
point  of  view.  By  considering  matches  directly  to  the  model 
at  run-time  instead  of  principal  views  the  redundancy  of  the 
principal  view  representation  is  avoided  and  the  viewpoint 
consistency  constraint  can  be  efficiently  applied. 


1  Introduction 

The  human  visual  system  can  recognize  objects  from  an  enor¬ 
mous  database  in  real  time.  Many  vision  systems  operate 
under  real-time  constraints,  and  a  vision  system  operating 
in  a  fairly  unconstrained  environment  may  have  to  recog¬ 
nize  objects  from  a  database  that  is  an  appreciable  fraction 
of  the  size  of  the  human  database.  Large  databases  have 
been  considered  by  Pattern  Recognition  researchers,  who 
have  achieved  success  at  indexing  into  them  using  decision 
trees  [Wang  and  Suen  1984].  Under  the  Pattern  Recognition 
paradigm  the  features  form  the  axes  of  a  vector  space  and 
an  object  is  represented  by  a  point  in  this  feature  space  [Fu 
1968].  The  features  are  often  assumed  independent. 

The  recognition  techniques  considered  here  are  object 
topology,  local  quantitative  measurements  and  the  viewpoint 
consistency  constraint.  They  do  not  fit  naturally  into  the 
Pattern  Recognition  paradigm,  but  this  does  not  prevent  a 
vision  system  that  uses  them  to  structure  its  database  in  the 
form  of  a  decision  tree. 

Given  a  probability  distribution  on  the  frequency  of  ob¬ 
jects,  the  viewpoints  from  which  they  are  seen,  and  the  errors 
which  occur  in  the  low-level  system,  a  decision  tree  can  be 
built  that  optimizes  the  expected  cost  of  recognition.  Thus, 
a  decision  tree  data  structure  differentiates  among  more  im¬ 
portant  and  less  important  features  for  distinguishing  among 
objects.  This  difference  in  importance  of  features  cannot  be 
determined  a  priori  but  is  instead  a  function  of  the  database. 
Turney  et  al.  [1985]  refer  to  this  concept  as  saliency. 

The  aim  of  this  work  is  to  show  that  a  decision  tree  should 
be  considered  as  a  data  structure  to  store  a  database  for  fast 
visual  recognition  outside  of  the  field  of  pattern  recognition. 
Here  a  recognition  system  is  described,  called  the  Tree  sys¬ 
tem,  that  uses  a  decision  tree.  For  simplicity,  the  objects  are 
restricted  to  be  polyhedra,  but  it  is  expected  that  the  over¬ 
all  design  of  the  system  will  extend  to  less  restricted  worlds. 
This  is  because  the  tests  used  do  not  have  to  be  restricted 
to  the  ones  used  in  the  Tree  system,  but  tests  for  closure, 
termination,  inside-outside,  color  and  texture  could  be  used 
as  well,  for  example. 

Because  the  Tree  system  uses  object  topology  as  a  major 
route  to  recognition,  the  concept  of  principal  vines  [Koen- 
derink  and  van  Doom  1976]  [Freeman  and  Ohakravarty  1980] 


is  used  in  the  construction  of  the  decision  tree.  Each  principal 
view  defines  a  family  of  possible  projections,  all  members  of 
which  are  topologically  identical  and  differ  only  by  a  contin¬ 
uous  transformation.  The  Tree  system  uses  principal  views 
in  the  preprocessing  stage  to  calculate  the  possible  topolog¬ 
ical  appearance  of  the  objects  in  the  database.  The  deci¬ 
sion  tree  itself,  however,  is  almost  completely  independent 
of  them,  and  instead  deals  directly  in  matches  to  the  object 
model.  This  was  done  to  avoid  the  problems  associated  with 
the  naive  application  of  principal  views,  which  are 

1.  The  correspondence  among  the  same  edges  in  different 
principal  views  is  not  represented. 

2.  There  are  a  large  number  of  principal  views.  To  solve 
for  the  viewpoint  it  is  often  not  necessary  to  discover 
the  exact  principal  view  that  occurs  in  the  image. 

Because  the  decision  tree  deals  directly  in  matches  to  the 
model,  states  in  the  decision  tree  correspond  to  matches  to 
object  edges,  and  so  problem  1)  does  not  exist.  Problem  2) 
is  eliminated  by  allowing  the  possibility  of  solving  for  the 
viewpoint  before  the  principal  view  in  the  image  has  been 
uniquely  determined.  Instead  it  is  done  when  its  utility  rises 
above  the  utility  of  testing  for  more  topological  constraints 
or  for  other  viewpoint  invariant  features.  In  the  Tree  system 
principal  views  provide  only  the  lowest  grain  to  which  the 
possible  viewpoints  are  partitioned;  at  any  point  in  the  tree 
the  possible  range  of  viewpoints  consists  of  a  collection  of 
principal  views  of  one  or  more  models. 

2  Previous  Work 

The  theory  of  the  use  of  decision  trees  in  sequential  pattern 
recognition  is  covered  in  [Fu  1968).  An  application  to  a  large 
database  of  two-dimensional  objects  can  be  seen  in  [Wang 
and  Suen  1984]. 

Burns  and  Kitchen  [1987]  have  designed  and  implemented 
a  system  with  a  similar  approach  to  the  one  described  here. 
Rather  than  constructing  a  decision  tree  as  the  Tree  system 
does,  they  construct  what  they  call  a  prediction  hierarchy. 
They  have  a  set  of  rules  for  building  the  tree  whose  aim  is  to 
minimize  the  size  of  the  hierarchy,  rather  than  to  minimize 
run-time.  At  run-time  the  hierarchy  is  searched  from  the  bot¬ 
tom  up  by  combining  features  found  in  the  image  whenever 
the  hierarchy  allows  it.  They  do  not  consider  the  run-time 
complexity,  nor  does  the  design  explicitly  aim  to  achieve  the 
minimum  complexity,  but  since  the  search  can  benefit  from 
parallel  processing  it  could  be  competitive. 

Cooper  and  Hollbach  [1987]  have  suggested  using  a  series 
of  filters  arranged  in  a  hierarchy,  using  local  computationally 
inexpensive  tests  in  parallel  over  the  image  and  principal  view 
database  at  the  beginning  and  more  computationally  expen¬ 
sive  tests  at  later  stages  when  the  number  of  possible  models 
has  been  substantially  reduced.  The  optimal  decision  tree  ap¬ 
proach,  by  contrast,  can  eliminate  models  without  explicitly 
testing  against  each  of  them,  and  is  constructed  to  use  the 
features  that  are  most  salient,  as  dictated  by  the  database. 


The  decision  tree  approach  cannot  benefit  from  massive  par¬ 
allelism,  but  does  the  best  possible  without  it  and  does  not 
suffer  from  its  costs. 

Robert  Bolles  has  designed  systems  for  fast  recognition  for 
robotics  that  use  preprocessing  to  design  an  optimal  strategy 
at  runtime  [Bolles  and  Cain  1982]  [Bolles  and  Horaud  1986]. 
His  approaches  only  match  against  one  model  at  a  time,  and 
therefore  rely  on  the  uniqueness  of  the  features  being  detected 
to  achieve  fast  runtime.  Chris  Goad  [1987]  has  also  investi¬ 
gated  using  preprocessing  to  minimize  recognition  time.  His 
approach  can  only  consider  one  possible  assignment  of  image 
features  to  model  features  at  a  time,  and  so  will  have  com¬ 
plexity  linear  in  the  size  of  the  database.  The  decision  tree 
has  the  potential  of  logarithmic  time  complexity. 

3  Decision  Tree  Construction 

The  standard  way  of  building  an  optimal  decision  tree  is  to 
build  it  back-to-front  using  a  dynamic  programming  approach 
[Fu  1968],  The  tree  is  built  back-to-front  because  the  op¬ 
timal  decision  at  a  particular  node  depnds  on  knowing  all 
future  optimal  decisions.  Building  an  optimal  decision  tree 
is  intractible  for  the  domain  the  Tree  system  is  designed  for 
because  the  space  of  possible  states  is  large. 

Instead  of  constructing  an  optimal  decision  tree,  one  can 
use  a  greedy  method  to  construct  the  tree  in  a  forward  fash¬ 
ion.  An  estimate  is  made  of  the  expected  cost  of  recognition 
in  the  subtrees  that  would  result  from  all  the  possible  tests 
at  a  given  leaf  of  the  partial  decision  tree.  Based  on  this  es¬ 
timate,  the  test  is  chosen.  In  the  appendix  it  is  shown  that 
in  many  cases  the  estimate  takes  the  form  of  an  information 
entropy  function  (see  [Shannon  1948]  or  a  text  in  information 
theory  [Jones  19791).  The  examples  in  Section  s5  assume  an 
entropic  cost  function.  To  improve  the  estimate  lookahead 
can  be  used,  in  a  similar  fashion  to  game  tree  search  [Sla¬ 
gle  and  Lee  1971],  The  ‘opponent’  in  this  case  is  nature,  or 
chance. 

4  Dealing  With  Errors 

It  is  possible  to  account  for  errors  in  the  input  to  the  recogni¬ 
tion  system  either  in  the  construction  of  the  decision  tree  or 
at  runtime.  If  errors  are  considered  in  the  construction  of  the 
decision  tree,  the  nodes  in  the  decision  tree  represent  states 
which  include  the  possibility  that  an  error  has  been  made. 
The  same  object  becomes  a  possible  match  on  more  than  one 
branch  of  the  tree,  a  condition  termed  overlap  in  the  Pattern 
Recognition  literature  [Wang  and  Suen  1984]. 

If  the  runtime  approach  is  used,  as  is  used  in  the  Tree  sys¬ 
tem,  the  states  do  not  include  the  possibility  of  error.  Instead 
multiple  processes  explore  the  decision  tree  concurrently.  If 
doubt  in  a  decision  rises  above  a  threshold,  processes  simul¬ 
taneously  investigate  the  branches  of  the  decision  tree  that 
descend  from  each  plausible  outcome.  The  number  of  pro¬ 
cesses  is  kept  to  within  the  number  of  processors  available 
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by  ranking  them  according  to  confidence.  The  confidence  as¬ 
sociated  with  initial  thread  of  control,  C\,  is  initially  set  to 
1.  Whenever  decision  i  is  made,  the  confidence  of  the  thread 
of  control  is  multiplied  by  the  probability,  P,,  that  the  deci¬ 
sion  made  was  the  correct  one.  So  the  confidence  of  the  t’th 
thread  of  control  after  its  fc’th  decision  is 

<?«(*) =n  3 

i= l 

With  C  defined  as  described,  it  corresponds  to  the  prob¬ 
ability  that  the  thread  of  control  is  the  correct  one,  under 
the  assumption  that  the  probabilities  of  each  decision  being 
correct  are  independent. 

Processes  are  terminated  either  when  they  fall  below  a  con¬ 
fidence  threshold  or  when  they  arrive  at  a  leaf  node  in  the 
decision  tree.  The  confidence  threshold  can  be  made  dynamic 
at  the  cost  of  some  communication  among  the  processors.  A 
dynamic  threshold  is  desired  because  not  all  objects  are  nec¬ 
essarily  of  the  same  probability,  and  low-probability  errors 
may  occur,  pushing  the  confidence  of  all  the  processes  down¬ 
ward.  A  list  of  the  most  promising  terminated  processes  can 
be  maintained  to  be  reactivated  if  the  confidence  of  running 
processes  drop. 

5  Examples 

This  section  describes  two  examples  that  demonstrate  various 
aspects  of  the  recognition  system. 

The  first  example  builds  a  complete  decision  tree  for  some 
very  simple  objects.  An  interesting  feature  of  the  this  exam¬ 
ple  is  that  a  greedy  technique  would  not  guarantee  an  optimal 
tree,  despite  its  simplicity. 

The  second  example  builds  part  of  a  decision  tree  for  one 
object  that  is  significantly  more  complex  than  those  in  the 
first  example. 

5.1  Example  1 

The  objects  are  a  triangle,  a  square,  a  pentagon  and  a  hep¬ 
tagon.  They  are  shown  in  Figure  1.  All  edges  are  the  same 
length  in  each  model.  The  models  are  all  assumed  equiprob- 
able,  so  each  may  appear  with  probability  0.25.  The  range 
of  possible  viewpoints  is  such  that  (scaled)  orthographic  pro¬ 
jection  is  a  good  approximation,  and  so  parallel  lines  appear 
parallel  and  lines  of  equal  length  in  the  models  appear  equal 
length  in  the  image. 

Figure  2  shows  all  possible  search  trees.  Since  the  objects 
are  so  simple  there  is  only  one  point  at  which  a  choice  can  be 
made.  The  choice  is  whether  to  (1)  expand  the  fourth  edge  or 
(2)  test  the  first  and  third  line  segments  for  parallelism.  As 
is  evident,  the  choice  to  expand  the  fourth  edge  is  the  better 
one.  Let  us  calculate  the  expected  entropy  of  both  choices. 

We  are  at  state  i,  which  is  reached  with  probability  0.75. 
For  the  choice  of  expanding  the  fourth  edge,  we  find  two 
possible  outcomes.  In  the  case  when  the  object  is  a  square 
the  cycle  of  edges  closes  (state  if).  Both  for  the  case  of  the 
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Figure  1:  Model  Base  (Example  1) 


e  =  expand  edge 
p  =  test  for  parallel 


A/n 


P=0.25. 


P=0.5 


P=0.5 


P=0.5 


P=0.5 


Figure  2:  Two  Possible  Search  Trees  (Example  1) 
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Figure  3:  The  House  Model 

pentagon  and  the  heptagon  the  cycle  is  not  closed  (state  iii). 
The  probabilities  of  each  state,  given  that  we  have  reached 
state  i  are 

P(n|i)  =  ^=0.333 

P(««i|i)  =  ~=  0.667 

The  entropy  of  state  it  is  zero,  since  the  match  is  known.  The 
probabilities  each  match  once  state  iii  has  been  reached  are 
P(pentagon)  =  0.5  and  P(heptagon)  =  0.5,  and  so 

H(iii)  =  -£P,logP, 
j 

=  -2(0.5  log  0.5) 

=  0.693 

The  expected  entropy  is 

E{HX)  =  (0.333)(0)  +  (0.667)(0.693) 

=  0.462 

A  similar  calculation  shows  that  the  expected  entropy  of 
choice  2  is  equal  to  the  expected  entropy  of  choice  1  and  so 
the  greedy  method  of  minimizing  entropy  does  not  guarantee 
choosing  the  optimal  tree  in  this  case.  Searching  ahead  one 
ply  would  enable  the  entropy  method  to  choose  the  correct 
tree  tree,  however. 

5.2  Example  2 

Figure  3  shows  the  object  whose  pose  the  decision  tree  will 
be  designed  to  recognize  (the  ‘house’  figure  from  [Burns  and 
Kitchen  1987]).  Figure  4  shows  the  principal  views  chosen  a 
priori  to  have  greater  than  zero  probability  of  occurence.  I 
will  ignore  testing  for  parallelism,  length  of  lines  or  solving  for 
the  viewpoint  and  only  consider  the  choice  of  edge  to  expand. 
This  will  be  done  using  the  greedy  entropy  measure.  Figure 
5  shows  possible  search  trees  starting  from  a  vertex  of  degree 
4  and  assuming  the  first  edge  expanded  finds  that  the  vertex 
of  degree  4  is  adjacent  to  another  vertex  of  degree  4. 

Because  of  the  symmetries  of  the  4-4  pair  of  vertices  there 
are  only  two  different  choices  of  edge  to  expand  which  I  will 
call  end  and  side  (left  and  right  branches  respectively  in  Fig¬ 
ure  5).  Tables  1  and  2  summarize  the  relevant  information 
for  the  end  and  side  choices  respectively.  In  these  tables  the 
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Figure  4:  Some  Principal  Views  of  the  House 
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Figure  5:  Possible  Search  Trees  (Example  2) 

views  columns  give  the  number  of  times  the  graph  or  match 
occur  in  each  principal  view.  The  frequency  /  can  be  calcu¬ 
lated  from  the  information  in  the  views  column.  For  a  graph 
or  match  G  it  is  just 

KG)  =  £  nv{G)P{v) 

v£V 

where  V  is  the  set  of  principal  views,  nv(G)  is  the  number  of 
occurences  of  G  in  principal  view  u,  and  P{v)  is  the  probabil¬ 
ity  of  the  principal  view  v.  Note  that  nv(G)  may  be  greater 
than  one  for  a  match  G  because  of  symmetry.  The  probabil¬ 
ities  are  obtained  from  the  frequencies  are  by  normalization. 
The  entropy  is 

#(<?)  =  ~I>logp, 

ies 

where  S  is  the  sample  space  under  consideration.  In  our  case 
the  sample  space  is  the  set  of  possible  matches.  The  expected 
entropy  is 

E(H)  =  '£P(G)H(G) 

G 

where  the  sum  is  over  all  possible  graphs  that  are  obtained 
by  expanding  the  edge  being  considered. 

The  tables  show  that  the  expected  entropy  is  lower  one 
chooses  to  expand  the  end  edge  ( E(H )  =  0.231)  over  the  side 
edge  (E(H)  =  0.722).  Therefore,  using  the  minimum  entropy 
method  without  lookahead  to  construct  the  tree  one  would 
choose  to  expand  the  end  edge. 

6  Conclusion 

The  Tree  recognition  system  described  in  this  paper  is  being 
implemented  to  test  the  growth  in  size  of  the  decision  tree 
with  the  size  of  the  database,  its  robustness  to  errors  in  input 
and  to  inaccurate  priors,  and  its  runtime  complexity.  We  are 
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Table  1:  Expand  end  edge 
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Theorem  1  Suppose  ev:ry  test  has  m  equiprobahle  outcomes 
and  that  each  of  the  n  possible  events  falls  into  exactly  one 
of  the  outcomes  of  each  test.  If  every  test  has  cost  c  then  the 
expected  cost  C  of  the  search  is 

E(0^-c±pi\oSmps 

1=1 

with  the  understanding  that 

P.logmP.  =  0 

if  pi  =  0  or  1. 

Proof.  The  proof  is  by  recursion. 

E(C)  =  0  at  a  leaf  node.  Since  at  a  leaf  node  p,  =  1  for 
some  j  and  p,-  =  0  for  all  t  ^  j  then 

n 

X>.l<>gmP.  =0 

i=l 

by  definition  and  the  theorem  holds. 

For  a  node  that  is  not  a  leaf  assume  the  theorem  holds  for 
all  its  k  subtrees.  Then  the  expected  cost  of  the  search  is 


also  looking  into  an  real-time  incremental  decision  tree  con¬ 
struction  algorithm,  which  would  allow  the  system  to  collect 

Now  since 

\ 

its  priors  from  experience  and  adjust  its  recognition  strategy 
on-line.  The  off-line  procedure  described  here  requires  statis¬ 
tics  to  be  collected  and  the  decision  tree  constructed  prior  to 

n  =  £n‘ 
;=i 

runtime. 

and 

P'i 

Acknowledgements 

we  have 

Pl  =  m 

I  would  like  to  thank  my  advisor  Dana  Ballard  for  his  guid¬ 
ance  and  suggestions,  Paul  Cooper  and  Nancy  Watts  for  in¬ 
fluential  discussions,  and  Chris  Brown  for  his  encouragement 
and  support. 

This  work  was  supported  in  part  by  U.  S.  Army  En¬ 
gineering  Topographic  Laboratories  research  contract  no. 
DACA76-85-C-0001,  in  part  by  NSF  Coordinated  Exper¬ 
imental  Research  grant  no.  DCR-8320136,  in  part  by 
ONR/DARPA  research  contract  no.  N00014-82-K-0193  and 
in  part  by  an  Natural  Sciences  and  Engineering  Research 
Council  of  Canada  Postgraduate  Scholarship. 

A  Entropy  as  a  Measure  of  Ex¬ 
pected  Cost 

Since  the  optimality  criterion  being  used  for  decision  tree  con¬ 
struction  is  minimum  expected  search  depth,  the  best  eval¬ 
uation  function  to  use  for  constructing  the  tree  is  one  that 
estimates  the  expected  search  depth  from  that  node.  Here 
we  show  that  under  many  conditions  that  function  takes  the 
form  of  an  appropriately  scaled  entropy  measure. 


E{C)  =  c- —  £mp<logm(mp,) 
m  ;= i 

=  c(l  - X>  logm(mp,)j 

=  c  ( 1  -  £>,  logm  m  -  £  Pi  logm  p, ) 

\  1=1  1=1  / 
n 

=  -c^P.!ogmp, 

«'=l 

and  the  theorem  is  proved.  Q.  E .  D. 

B  Unequal  Test  Outcomes 

Tests  normally  do  not  have  equally  likely  outcomes.  What 
can  we  say  about  tests  whose  outcomes  are  not  equally  likely? 
It  is  possible  to  analyze  this  problem  analytically  if  we  assume 

1.  the  distribution  of  probabilities  of  the  m  outcomes  of 
each  test  is  constant, 

2.  the  probabilities  of  each  outcome  are  all  powers  of  £ ,  and 

3.  all  kn  events  are  equally  likely. 
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Let  the  outcomes  of  the  test  be  ot,  i  =  1, 2, ...  m  and  de¬ 
note  their  probabilities  by  P(o,).  Without  loss  of  generality 
assume  that  o\  is  the  most  likely  outcome.  Say  its  probability 
is  £  for  some  k.  Then  the  expected  depth  of  the  search  tree 


E(kn)  =  c+'£P(o,)E(knP(o,)) 

i=  1 

=  c+'£P(oi)E(kn+]°'>pM) 

«=i 

If  we  rewrite  this  equation  it  can  be  put  in  a  form  in  which 
it  is  readily  recognizable  as  a  difference  equation.  Let 

A,-  =  -  log*  P(0i). 


an  =  E(kn). 


<J„  =  c  +  ^2  P(pi)an _Ai 

fel 

or 

m 

a„  -  5^  P(o,)an-A.  =  c.  (1) 

<= l 

Ignoring  initial  conditions  for  now,  we  can  try  to  find  a  so¬ 
lution  to  the  difference  equation  by  trying  a  general  form 
and  solving  for  the  undetermined  coefficients.  A  good  rule  of 
thumb  is  to  try  an  expression  that  looks  like  the  right  hand 
side  when  the  equation  is  written  in  the  form  of  Equation  1. 
Trying  a  constant  solution  bo  does  not  work  because  then  the 
left  side  sums  to  zero.  If  we  try  a  linear  solution  bin  we  get 
by  substituting  into  Equation  1 

_ £ _ 

and  so  we  have  found  a  particular  solution  to  the  difference 
equation.  The  difference  between  any  two  solutions  to  a  non- 
honiogeneous  equation  with  constant  coefficients  is  a  solution 
to  the  homogeneous  equation 

m 

an  P(ni)<ln-A;  = 

i=l 

Since  an  is  a  just  a  weighted  average  of  previous  solutions  it 
is  easy  to  see  that  the  solutions  to  the  homogeneous  equation 
do  not  grow  with  n.  Therefore  any  solution  to  the  nonhomo- 
geneous  equation  is  the  particular  solution  we  found  above, 
to  within  a  factor  that  does  not  grow  with  n. 

Replacing  an  by  the  original  notation  we  have 

E(kn)  =  6,n 

or  letting  nt  =  kn  and  substituting  for  f>i 


E{nt)  = 


clogt  nt 


vu  -zzlp(oi)\ogkp(oiy 

Here  again  an  entropy-type  factor  is  introduced. 


C  Different  Tests 

What  if  different  tests  have  different  distributions  of  out¬ 
comes?  Suppose  there  are  T  different  tests  each  of  which 
has  k{  equally  likely  outcomes,  for  i  =  1,2, ...  7\  Suppose 
each  test  is  used  with  probability  P(t,)  and  has  cost  c.  Then 
an  idea  of  how  to  obtain  a  combined  estimate  of  expected 
cost  can  be  obtained  by  imagining  the  successive  application 
of  the  tests  in  proportion  to  their  relative  probabilities.  Con¬ 
sider  it  as  a  combined  test.  The  probability  of  each  outcome 
will  be 

p(o)=mbPnu) 

.=i  Kt 

where 

=  lcm(p^-y) 

assuming  the  P(f,)’ s  are  rational.  This  is  just  as  if  one  test 
with 

;=i 

outcomes  was  repeated  /?  times.  So  we  see  that  the  geometric 
mean  of  the  numbers  of  outcomes  of  each  test  is  the  effective 
number  of  outcomes. 
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Abstract 

A  model-based  vision  system  requires  models  to  predict 
object  appearances.  How  an  object  appears  in  the  image  is  a 
result  of  interaction  between  the  object  properties  and  the 
sensor  characteristics.  Thus,  in  model-based  vision,  we  ought 
to  model  the  sensor  as  well  as  modeling  the  object.  In  the 
past,  however,  the  sensor  model  was  not  used  in  model-based 
vision  or,  at  least,  was  contained  in  the  object  model  im¬ 
plicitly. 

This  paper  presents  a  framework  between  an  object  model 
and  the  object's  appearances.  We  consider  two  aspects  of 
sensor  characteristics:  sensor  detectability  and  sensor 
reliability.  Sensor  detectability  specifics  what  kinds  of  fea¬ 
tures  can  be  detected  and  in  what  condition  the  features  are 
detected;  sensor  reliability  is  a  confidence  for  the  detected 
features.  We  define  the  configuration  space  to  represent 
sensor  characteristics.  We  propose  a  representation  method 
for  sensor  detectability  and  reliability  in  the  configuration 
space.  Finally,  we  investigate  how  to  apply  the  sensor  model 
to  a  model-based  vision  system,  in  particular,  automatic 
generation  of  an  object  recognition  program  from  a  given 
model. 

1.  INTRODUCTION 

A  model-based  vision  system  requires  models  to  predict 
object  appearances.  Various  researchers  propose  many  kinds 
of  object  models,  ranging  from  generic  models  such  as 
generalized  cylinders1,2,  extended  Gaussian  images-1,4,  and 
super  quadric  models’,  to  specific  models  such  as  aspect 
models6,7,8,  region-relation  models9,  and  smooth  local 
symmetry10. 

This  research  was  sponsored  by  the  Defense  Advanced 
Research  Projects  Agency,  DOD,  through  ARPA  Order  No. 
4976,  monitored  by  the  Air  Force  Avionics  Laboratory  under 
contract  F33615-84-K-1520,  and  under  The  Analytic  Sciences 
Corporation  sub  constract  87123  modification  No.l.  The 
views  and  conclusions  contained  in  this  document  are  Uiose  of 
the  authors  and  should  not  be  interpreted  as  representing  the 
official  policies,  either  expressed  or  implied,  of  the  Defense 
Advanced  Research  Projects  Agency  or  of  the  U.S.  Govern¬ 
ment. 


The  object  appearances,  however,  are  determined  by  a 
product  of  an  object  model  with  a  sensor  model.  As  shown  in 
Figure  1,  the  same  object  model  in  the  same  attitude  can 
create  different  appearances  and  features  when  seen  by  dif¬ 
ferent  sensors.  Edge-based  binocular  stereo  reliably  detects 
depth  at  edges  perpendicular  to  the  epipolar  lines. 
Photometric  stereo  or  a  light-stripe  range  finder  detects  sur¬ 
face  orientation  and  depth  of  surfaces  which  are  illuminated 
and  visible  both  by  the  light  source  and  by  the  cameras. 


Binocular 

Stereo 


B 

Photometric 

Stereo 


Range  Finder 


Figure  1:  Object -appearances. 

Thus,  in  model  based  vision,  it  is  insufficient  to  consider 
only  an  object  model;  it  is  essential  to  exploit  a  sensor  model 


as  well.  On  the  other  hand,  modeling  sensors  for  model-based 
vision  has  attracted  little  attention;  quite  often,  researchers 
who  are  familiar  with  the  sensors  they  use  tended  to  construct 
object  appearances  by  implicitly  incorporating  their  sensor 
behavior.  This  paper,  in  contrast,  explores  a  general 
framework  for  explicitly  incorporating  sensor  models  which 
govern  the  relationship  between  object  models  and  object 
appearances. 

A  sensor  model  must  be  able  to  specify  two  important 
characteristics;  sensor  detectability  and  sensor  reliability. 
The  sensor  detectability  specifies  what  kind  of  features  can  be 
detected  in  what  condition.  The  sensor  reliability  is  a 
measure  of  confidence  for  the  detected  features.  This  paper 
presents  a  method  for  modeling  sensors  with  sensor  detec¬ 
tability  and  sensor  reliability,  and  how  to  use  them  in  model- 
based  vision.  We  define  the  configuration  space  to  represent 
sensor  characteristics.  Next,  we  consider  two  aspects  of  sen¬ 
sor  characteristics:  sensor  detectability  and  sensor  reliability. 
We  propose  a  representation  method  for  sensor  detectability, 
and  examine  how  to  use  the  representation  for  generating 
object  appearances.  Sensor  reliability  analysis  consists  of 
determining  uncertainty  in  sensory  measurement  and  analyz¬ 
ing  uncertainty  propagation  from  sensory  measurements  to 
geometric  features.  Finally,  we  investigate  the  way  to  use 
these  sensor  models  for  model  based  vision,  in  particular,  our 
on-going  project,  automatic  generation  of  object  recognition 
programs. 

2.  SENSORS  IN  MODEL-BASED  VISION 

Different  types  of  sensors  are  used  in  model-based  vision. 
For  our  purpose,  "sensors"  are  transducers  which  transform 
"object  features"  into  "image  features".  For  example,  an  edge 
detector  detects  edges  of  an  object  as  lines  in  an  image. 
Photometric  stereo  measures  surface  orientations  of  surface 
patches  of  an  object.  There  are  both  passive  and  active 
sensors.  Binocular  stereo  is  passive,  while  a  light-stripe  range 
finder  is  an  active  sensor  using  actively  controlled  lighting. 
Table  1  gives  a  summary  of  various  sensors  in  terms  of  what 
object  features  are  detected  in  what  forms. 


Table  1  Summary  of  Sensors 


Sensor 

Vertex 

fcdge 

Face 

active/passive 

Edge  Detector11,  ,2* 13 

- 

line 

passive 

Shape- from -shading14' 15 

- 

region 

passive 

Synthetic  Aperture  Radar16 

point 

point/line 

point 

active 

Time -of -Flight  Range  Finder17, 11 

- 

region 

active 

Light-stripe  Range  Finder30 

region 

active 

Binocular  Stereo19, 20 

line 

passive 

Trinocular  Stereo21 

line 

passive 

Photometric  Stereo22, 23 

region 

active 

Polsrimetric  light  detectoT24 

point 

active 

In  addition  to  qualitative  descriptions  of  a  sensor,  a  sensor 
model  must  describe  two  characteristics  quantitatively: 
detectability  and  reliability.  Detectability  specifies  what  kind 
of  features  can  be  detected  in  what  conditions.  Reliability 
specifies  the  expected  uncertainty  in  sensory  measurement 
and  geometric  features  derived  from  measurement.  Since 
these  two  characteristics  depend  on  how  the  sensor  is  located 
relative  to  an  object  feature,  we  will  first  define  a  feature 
configuration  space  to  represent  the  spatial  relationship  be¬ 
tween  the  sensor  and  the  feature.  Then,  we  will  investigate  the 
way  to  specify  detectability  and  reliability  over  the  space. 

3.  FEATURE  CONFIGURATION  SPACE 

Whether  and  how  reliably  a  sensor  detects  an  object  feature 
depend  on  various  factors:  distance  to  a  feature,  attitude  of  a 
feature,  reflectivity  of  a  feature,  transparency  of  air,  ambient 
lighting,  and  so  forth.  In  most  model-based  vision  problems, 
the  attitude  of  a  feature,  that  is,  angular  freedom  in  the 
relationship  between  a  feature  and  a  sensor,  affects  sensor 
characteristics  most.  In  order  to  specify  this  freedom  ex¬ 
plicitly,  we  attach  a  coordinate  system  to  an  object  feature  and 
consider  the  relationship  between  the  sensor  coordinate  sys¬ 
tem  and  the  feature  coordinate  system.  For  example,  for  a 
face  feature,  we  define  a  coordinate  system  so  that  the  z  axis 
of  the  feature  coordinate  system  agrees  with  the  surface  nor¬ 
mal  and  the  x-y  axes  lie  on  the  face,  but  are  defined  arbitrarily 
otherwise.  For  other  features,  we  can  define  feature  coor¬ 
dinates  appropriately.  See  Appendix  1  for  more  details. 

Since  angular  relationships  between  the  two  coordinate  sys¬ 
tems  are  relative,  for  the  sake  of  convenience  we  fix  the 
sensor  coordinate  system  and  discuss  how  to  specify  feature 
coordinates  with  respect  to  it.  The  angular  relation  from  the 
sensor  coordinate  system  to  a  feature  coordinate  system  can 
be  specified  by  three  degrees  of  freedoms:  two  degrees  of 
freedom  in  the  direction  of  the  z  axis  of  a  feature  coordinate 
system,  and  one  degree  of  freedom  in  the  rotation  about  the  z 
axis.  See  Figure  2  (a). 

Since  we  will  consider  the  angular  relationship,  we  can 
translate  the  feature  coordinates  so  that  the  two  coordinate 
systems  share  the  origin.  Then,  we  will  define  a  sphere 
whose  origin  is  located  at  die  origin  of  the  sensor  coordinate 
system,  and  whose  north  pole  is  on  the  z  axis  of  the  sensor 
coordinate.  We  will  specify  a  feature  coordinate  as  a  point  in 
the  sphere.  Referring  to  Figure  2  (b),  the  north  pole  of  the 
sphere  is  made  to  correspond  to  the  case  when  the  feature 
coordinates  are  aligned  completely  with  the  sensor  coor¬ 
dinates.  For  other  feature  coordinates,  the  direction  from  the 
sphere  center  to  the  point  coincides  with  the  z  axis  of  the 
feature.  The  distance  from  die  spherical  surface  to  the  point  is 
determined  by  die  angle  of  rotation  (modulo  360°)  around  the 
z  axis  from  the  coordinate  on  the  spherical  surface.  A  point 
on  the  spherical  surface  represents  a  feature  coordinate  ob¬ 
tained  by  rotating  the  sensor  coordinate  around  the  axis  per¬ 
pendicular  to  plane  given  by  the  sphere  center,  the  spherical 


698 


point,  and  the  north  pole.  We  will  refer  to  this  sphere  as  the 
feature  configuration  space.1  Figure  2(c)  shows  various  fea¬ 
ture  coordinates  corresponding  to  spherical  points,  while 
Figure  2(d)  shows  those  corresponding  to  points  on  a  radical 
axis. 


sensor 

coordinate 


feature 

coordinate 


(b) 


Figure  2:  Feature  configuration  space: 
(a)  A  feature  coordinate  with  respect  to 
the  sensor  coordinate;  (b)  A  feature  con¬ 
figuration  space;  (c)  Feature  coordinates 
on  the  spherical  surface;  (d)  Feature 
coordinates  along  a  radial  axis. 


4.  DETECTION  CONSTRAINTS  and 
ASPECTS 

In  the  previous  section,  we  have  defined  the  way  to 
represent  the  relationship  between  sensor  coordinates  and  fea¬ 
ture  coordinates.  In  this  section,  we  will  develop  a  constraint 
to  determine  whether  a  feature  can  be  detected  at  each  point 
of  the  configuration  space. 

4.1.  Detection  Constraints 

Each  sensor  has  two  components:  illuminators  and  detec¬ 
tors.  For  example,  both  a  time-of-flight  range  finder  and  a 
light-stripe  range  finder  have  one  light  source  and  one  TV 
camera.  Binocular  stereo  has  one  light  source  (without  light 


This  representation  will  no!  create  discontinuities  around  the  north  pole  as 
opposed  to  lltc  case  in  which  Euler  angles  from  the  sensor  coordinate  frame  to 
the  feature  coordinate  frame  are  used  to  specify  spherical  points;  this 
representation  will  instead  create  discontinuities  at  the  center  of  (lie  sphere  and 
at  the  south  pole.  However,  this  is  advantageous  because  we  mostly  use  the 
area  around  lire  north  pole  to  discuss  detectability  and  reliability. 


sources  you  cannot  observe  anything)  and  two  TV  cameras; 
photometric  stereo  has  three  light  sources  and  one  TV  camera. 
Table  2  summarizes  the  number  of  illuminators  and  detectors 
of  each  sensor. 


Table  2  Illuminator  and  Detector 

Sensor 

Number  of  illuminators 

Number  of  detectors 

Edge  Detector 

1 

1 

Shape- from -shading 

1 

1 

SAR 

1 

I 

Time-of-Flight  R  ange  Finder 

1 

1 

Light-stripe  Range  Finder 

i 

1 

Binocular  Stereo 

1 

2 

Trinocular  Stereo 

1 

3 

Photometric  Stereo 

3 

1 

Polarimetric  light  detector 

n 

1 

One  illuminator  only  illuminates  one  part  of  an  object;  one 
detector  only  observes  one  part  of  the  object.  Each  sensor, 
which  consists  of  illuminators  and  detectors,  only  detects  one 
part  of  the  object.  In  order  for  a  feature  to  be  detected  by  a 
sensor,  the  feature  must  satisfy  certain  conditions  on  being 
illuminated  by  its  illuminators  and  being  visible  front  its 
detectors. 

Once  we  define  a  local  coordinate  system  on  a  feature,  we 
can  compute  configuration  of  a  feature  in  which  it  is  il¬ 
luminated  by  each  illuminator,  and  configurations  in  which  it 
is  visible  by  each  detector.  In  the  following  discussion,  we 
will  consider  both  illuminators  and  detectors  as  generalized 
sources  (G-sources).  Each  G-source  has  two  properties:  the 
illumination  direction  and  the  illuminated  configurations.  In 
the  source  case,  these  two  terminologies  are  the  same  as  the 
nominal  meanings.  In  the  case  of  detectors,  the  illumination 
direction  corresponds  to  the  line  of  sight  of  the  detector,  and 
the  illuminated  configurations  correspond  to  the  visible  con¬ 
figurations  from  the  detector. 

G-source  illumination  direction  can  be  represented  in  the 
feature  configuration  space  as  a  radial  line  from  the  sphere 
center.  G-source  illuminated  configurations  can  be  specified 
as  a  volume  in  the  configuration  space.  Finally,  we  can  obtain 
the  constraints  in  which  the  feature  is  detectable  by  the  sensor 
with  AND  operations  on  illumination  directions  and  il¬ 
luminated  configurations  of  all  component  G-sources  of  the 
sensor. 

Figure  3(b)  shows  an  example  analysis  of  a  face  feature  for 
a  light-stripe  range  finder  in  Figure  3(a).  A  !igh!-strij>e  range 
finder  has  two  G-sources  (a  TV  camera  and  a  light  source): 
the  direction  denoted  by  VI  indicates  the  line  of  sight  of  the 
TV  camera;  V2  indicates  the  illumination  direction  of  the  light 
source.  The  illuminated  configurations  of  a  face  from  die 
light  source  are  determined  by  the  z  axis  (ie,  its  surface 
normal)  of  a  face  feature  coordinate,  and  are  not  dependent  on 
its  rotation.  Therefore,  illuminated  configurations  of  a  feature 


form  a  spherical  cone  whose  axis  is  V2  and  whose  apex  angle 
is  dl.  Similarly,  the  configurations  of  a  feature  visible  from 
tlte  TV  camera  form  a  spherical  cone  whose  center  direction 
is  VI  and  whose  apex  angle  is  dl.  Since  a  light-stripe  range 
finder  detects  the  faces  which  are  illuminated  from  the  source 
and  visible  from  the  TV  camera,  the  detectable  configurations 
(necessary  condition)  are  the  intersection  of  the  two  cones. 
Thus,  the  resulting  detection  constraints  in  Figure  3(b)  contain 
three  constraint  components;  two  illumination  directions  and 
one  volume  of  detectable  configurations. 

ft 

O 

/ 


(a)  (b) 

Figure  3:  Detection  constraints  of  a 
light-stripe  range  finder:  (a)  Our 

hypothetical  light-stripe  range  finder;  (b) 
Constraints  in  the  feature  configuration 
space.  Note  that  there  are  two  kinds  of 
constraints;  illuminated  configurations 
and  illumination  directions. 

Similarly  we  can  analyze  die  detection  constraints  for 
various  sensors  in  Table  1.  The  results  of  the  analysis  are 
summarized  in  Figure  4.  The  right  column  represents  each 
sensors  detection  constraints  in  the  feature  configuration 
space.  The  center  column  represents  our  formal  language  to 
specify  the  detection  constraints  to  our  geometric  modeler  to 
be  used  for  generating  object  appearances.  The  definition  of 
the  language  is  found  in  appendix  II. 

4,2.  Use  of  Detection  Constraints 

To  predict  object  appearances,  we  apply  the  constraints  to 
each  feature  of  the  object.  Fach  feature  is  detectable  by  the 
sensor  if  it  satisfies  the  following  two  conditions: 

1.  None  of  the  illumination  directions  are  occluded 
by  any  other  parts  of  the  object 


Sensor 

Constraints 

in  (he  formal  definition 

Constrainu 
in  the  sensor  space 

Edge  Detector 

(AND  (NS  edge  V  d) 

(NS  edge  V  d)) 

-  (NS  edge  V  d) 

0 

Shape-from-shading 

(AND  (NS  lace  V  d) 

(NS  lace  V  d)> 

-  (NS  face  V  d) 

SAR 

(OR  (NS  lace  V  d) 

(NS  edge  V  d) 

(NS  vertex  V  d)) 

(needs  poatprocesa) 

0 

Time-of-Flight  Range 
Finder 

(ANO  (NS  lace  V  d) 

(NS  lace  V  d)) 

-  (NS  (ace  V  d) 

© 

Light-stnp  Range 
Finder 

(AND  (NS  lace  Vi  d) 

(NS  lace  V2  d)) 

0 

Binocular  Stereo 

(AND  (NS  edge  VI  dl) 

(DS  edge  V2  d2  VE  de) 
(DS  edge  V 3  d3  VE  de)) 

© 

Tnnocular  Stereo 

(AND  (NS  edge  VI  dl) 

(DS  edge  V2  d2  VE  de) 
(DS  edge  V3  d2  VE  de) 
(DS  edge  V4  d2  VE  de)) 

Photometric  Stereo 

(AND  (NS  lace  V  dl) 

(NS  face  VI  d2) 

(NS  (ace  V2  d2) 

(NS  face  V3  d2)) 

0 

Polanmetnc  Light 

Detector 

(OR  (AND  (NS  (ace  V  d) 

(NS  (ace  Vi  d)) 
(AND  (N3  lace  V  d) 

(NS  face  V2  d)) 

) 

where  V  V  •  coa  2d 

2.  The  detectable  configurations  contain  the  con¬ 
figuration  of  the  feature. 


Figure  4:  Detection 
various  sensors. 


consti  amts 


To  check  these  conditions  we  use  the  constraints  with  a 
geometric  modeler.  We  rotate  the  object  into  a  certain  at¬ 
titude  to  be  examined,  and  then  see  whether  its  features 
satisfy  the  previous  constraints.  Figure  5  illustrates  this 
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process  of  predicting  object  appearances  for  a  light-stripe 
range  finder.  Suppose  an  object  is  placed  like  Figure  5  (a). 
Figure  5  (b)  shows  the  detection  constraints  on  a  face  for  a 
light-stripe  range  finder.  We  will  put  this  configuration  space 
on  each  candidate  face  to  examine  whether  the  face  is  detect¬ 
able.  See  Figure  5(c).  This  amounts  to  checking  the  follow¬ 
ing  conditions 

1. The  light  source  direction  is  not  occluded  by 
other  faces. 

2.  The  line  of  sight  of  the  TV  camera  is  not  oc¬ 
cluded  by  other  faces. 

3.  The  local  coordinate  of  an  face,  defined  by  the 
surface  normal  (z  axis)  and  the  tangential  plan 
(x-y  axis),  is  contained  in  the  detectable  con¬ 
figurations. 

Figure  5(d)  shows  the  result  of  this  operation.  The  shaded 
areas  indicate  those  which  satisfy  the  conditions  and  thus  are 
detectable  by  the  light-stripe  range  finder. 

4.3.  Aspect 

By  using  object  appearances  by  the  detection  constraints, 
we  can  generate  aspects  defined  by  a  particular  sensor. 
Aspects  are  originally  defined  as  topologically  equivalent 
classes  with  respect  to  the  object  features6.  Since  each  sensor 
has  particular  features  to  be  detected,  we  have  to  modify  the 
definition  of  aspects  according  to  features  detected  by  a  sen¬ 
sor.  We  will  consider  aspects  given  from  appearances  by  a 
light-stripe  range  finder. 

We  can  define  aspects  for  a  light-stripe  range  finder  by 
those  faces  detectable  in  each  aspect.  Suppose  we  have  n 
faces,  Sl,S2>  ",‘V  Let  the  variable  X(  denotes  the  condition 
whether  or  not  the  face  S-  is  detected,  that  is 


f  - j  1  face  Sj  is  detectable; 
1  L  0  otherwise. 


An  n-tuple  (X|Jf2>- - Xn)  represents  a  label  of  an  object  ap¬ 
pearance  in  terms  of  face  detectability.  This  label  will  be 
referred  to  as  a  sluipe  label,  and  we  can  characterize  each 
object  attitude  wilh  this  label.  The  set  of  contiguous  object 
attitudes  that  have  the  same  shape  label  forms  an  aspect  in 
this  example. 

5.  DETECTABILITY  DISTRIBUTION  AND 
ASPECT  TRANSIT 

5.1.  Detectability  Distribution 

Ihe  feature  detection  constraint  gives  the  upper  bound  of 
the  detectable  configurations  in  the  space.  In  some  cases, 
however  even  though  an  object  feature  satisfies  the  detection 
constraints  it  may  be  undetected  due  to  noise.  We  define  the 
detectability  distribution  such  that  a  feature  in  the  detectable 
configurations  is  actually  detected.  The  probability  is  usually 
high  in  the  central  part  and  low  in  the  peripheral  part  of  the 
detectable  configurations. 


Detsctibte  ronfigumions 


violate  D 


V1.V2.D  violate  VI, V2.D 


Figure  5:  How  to  use  the  constraints: 
(a)  A  light-stripe  range  finder;  (b)  The 
detection  constraints;  (c)  Applying  the 
detection  constraints  to  object  features; 
(d)  Detectable  faces  determined. 


Since  all  sensors  in  Table  1  detect  features  based  on  a 
brightness  value  corresponding  to  a  feature,  the  detectability 
distribution  depends  on  a  brightness  value  which  is  detected 
and  converted  to  sensor  features.  However,  there  arc  two 
types  of  sensors  in  terms  of  the  conversion  method;  intensity 
based  sensors  and  position  based  sensors.  The  intensity  based 
sensor  measures  the  brightness  value  an  '  onverts  it  to  sensor 
features  directly.  The  position  based  sensor  measures  the 
brightness  value  and  gets  positional  information  of  the  bright 
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spot  if  the  brightness  value  is  greater  than  some  threshold. 
The  position  based  sensor  then  converts  the  positional  infor¬ 
mation  to  sensor  features.  Table  3  shows  a  classification  of 
sensors  based  on  this  difference. 

Since  the  detectability  distribution  depends  on  sensing 
methods,  we  will  develop  the  detectability  distribution  for  a 
light-stripe  range  finder  as  a  representative  case  of  the  posi¬ 
tion  based  sensor,  and  the  distributions  for  photometric  stereo 
as  a  representative  case  of  the  intensity  based  sensor. 

Table  3  Measurement  method 
sensor  I  sensing  method 


Edge  Detector 

intensity 

Shape-from-shading 

intensity 

SAR 

intensity 

Time-of-Flight  Range  Finder 

position 

Light-stripe  Range  Finder 

position 

Binocular  Stereo 

position 

Trinocular  Stereo 

position 

Photometric  Stereo 

intensity 

Polarimetric  light  detector 

intensity 

5.1.1.  Detectability  distribution  of  light-stripe  range  finder 

We  will  consider  the  detectability  distribution  of  a 
hypothetical  lijht-stripe  range  finder.  A  light-stripe  range 
finder  projects  a  plane  of  light  onto  the  scene  and  determines 
the  position  of  a  surface  patch  from  the  slit  image.  The 
detectability  depends  on  whether  the  brightness  of  the  slit 
image  is  bright  enough  to  be  detected,  say  brighter  than  a 
threshold  / q.  Assuming  a  Lambertian  surface,  the  brightness 
of  the  slit  image  is  given  by  /  N  S  where  N  is  the  surface 
normal,  S  is  the  light  source  direction,  and  /  is  the  light 
source  brightness.  If  we  assume  an  additive  zero-mean  Gaus¬ 
sian  noise  of  brightness  with  power  a2,  the  resultant  bright¬ 
ness  distribution  of  a  slit  will  be 

j  (/-//IS)2 
P(D  = - -«  "  I~2  • 


Thus,  the  probability  distribution  of  feature  detectability  of 
our  hypothetical  range  finder  can  be  described  as 

Pd  =  ProW*lo) 


r  +<» 
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As  shown  in  Figure  f>,  this  probability  decreases  as  the  in¬ 
cident  angle  of  the  light-snipe  increases,  and  near  the  bound¬ 
ary  of  the  illuminated  configuration  of  the  light  source,  the 
probability  approaches  0. 
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Figure  6:  Detectability  distribution  for  a 
light-stripe  range  finder 

5.1.2.  Detectability  distribution  of  photometric  stereo 

An  intensity  based  sensor  can  be  modeled  as 
y=G(x) 

where  x  is  the  input  brightness,  y  is  the  output  feature  values, 
and  G  is  the  conversion  function.  Suppose  X*  is  the  defini¬ 
tion  area  of  the  function  G;  ie,  the  intensity  based  sensor 
outputs  a  feature  value  y  from  any  x(-  if  x(  e  X*.  Then,  the 
detectability  distribution  can  be  determined  as  the  probability 
that  the  input  brightness,  x+5x,  disturbed  by  5x,  is  still  con¬ 
tained  in  the  definition  area,  X*. 

Photometric  stereo  determines  the  surface  orientation  from 
three  images  taken  from  the  same  position  under  different 
lighting  dir  'ions. 

/,=S,»N 

/2=s2»n 

/3=s3.n, 

where  /(, S(,N  are  the  brightness  value  under  light  source  i,  the 
i  th  light  source  direction  vector,  and  the  surface  normal 
vector,  respectively.  Thus,  expressing  the  brightness  as  a 
vector,  I.  and  the  light  source  as  a  matrix,  S, 

I=SN. 

Applying  S-1  to  both  sides,  we  obtain  an  explicit  expression 
of  N, 

N=S_I  I=A  I, 

where  A=S~'.  This  is  the  basic  idea  of  photometric  stereo22. 

The  definition  area  of  I  is  determined  as  an  inverse  area  of 
the  valid  surface  orientations.  Namely,  since  N  is  a  unit  vec¬ 
tor,  a  valid  surface  orientation  N  satisfies  N(N=1.  Once  we 
insert  the  relationship  between  surface  orientation  and  bright¬ 
ness  triple,  we  obtain  (Al/AI=l,  which  gives  the  definition 
area  of  I. 

Working  photometric  stereo  has,  however,  three 
modifications23  to  the  original  theory. 

1  Brightness  values  are  normalized  1/|I|  so  that 
we  can  cancel  the  albedo  effect. 
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2.  Threshold  operations  are  applied  to  brightness 
values  so  tliat  we  can  eliminate  shadow  regions. 

3.  A  is  determined  from  calibration  and  stored  in  a 
lookup  table  rather  than  calculated  from  the 
ideal  case. 

We  will  obtain  the  detectability  distribution  of  the 
photometric  stereo  according  to  these  modifications.  Assume 
a  brightness  value  moves  from  i,  to  ij+Si'j  due  to  TV 
camera’s  error.  The  normalization  gives 

T  |  +8f  j  =(i  i  +8e  |  )/(i  | +6i  |  +/2+I3 ) -  However,  the  normalized 
intensity  (I'j+Si'j/j.Tv)  exists  on  the  same  plane 
«'l+8i'|+i'2+i'3=l.  Since  a  continuous  area  on  the  plane  is  the 
definition  area  for  photometric  stereo  and  the  new  triple 
is  still  contained  in  the  definition  area  on  the 
plane,  we  can  obtain  the  solution  from  the  new  triple. 

That  is,  we  will  always  succeed  in  obtaining  the  feature 
values,  ie.  we  will  have  a  unit  detectability  distribution  for  the 
light  source  1.  (Though  of  course  the  resultant  value  may  be 
less  reliable  as  will  be  discussed  in  the  reliability  section.) 
The  same  discussion  is  applicable  to  light  source  2  and  light 
source  3.  This  analysis  reveals  that  normalization  makes  the 
detectability  a  unit  value,  and  thus,  helps  to  detect  features  in 
a  stable  manner. 

We  use  the  threshold  operations  to  detect  shadows.  This 
operation  also  affects  the  detectability  distribution.  Namely, 
in  the  case  that  all  three  brightness  values  are  greater  than  a 
threshold  value,  we  can  determine  the  surface  orientation. 
This  effect  can  be  modeled  in  the  same  way  as  the  light  stripe 
range  finder.  Thus, 
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gives  the  detectability  distribution  over  the  detectable  con¬ 
figurations,  where  /J(,Si  is  the  light  source  brightness,  the 
light  source  direction  of  the  1  illuminator,  respectively. 

5.2.  Transition  of  Aspect 

The  continuous  change  of  detectability  causes  the  con¬ 
tinuous  aspect  transition  and  the  aspect  boundaries  to  become 
blurred.  In  order  to  characterize  an  aspect  boundary,  we  can 
define  the  distance  between  two  aspects  across  the  boundary 
by  the  Hamming  distance  between  their  corresponding  shape 
labels  where  x~  1  if  face  i  is  detected  and 

x—O  otherwise.  Thus,  the  distance  of  two  aspects  is  the  num¬ 
ber  of  faces  which  switch  between  detected  and  nondetected 
states  across  the  aspect  boundary. 

Consider  an  aspect  boundary  between  aspect  A  and  B 
whose  Hamming  distance  is  one,  that  is,  aspect  A  and  B  differ 
in  detection  of  only  one  face  F f  Suppose  the  detectability  of 
face  i  is  Then,  near  the  aspect  boundary,  the  aspect  A 

may  be  observed  incorrectly  as  aspect  B  with  probability 
A  similar  false  observation  will  also  occur  for 

aspect  B. 


If  the  Hamming  distance  of  aspects  A  and  B  is  more  than 
one  across  a  boundary,  then  erroneous  intermediate  aspects, 
which  are  neither  A  nor  B,  can  occur  near  the  boundary.  This 
can  be  easily  seen  by  considering  an  example  where  aspect  A 
has  (jt(j^)=|  10)  and  aspect  B  has  (jtjjc  )=(()<)  as  shape 
labels,  respectively.  Then,  we  will  observe  object  ap¬ 
pearances  belonging  to  four  aspects  near  the  boundary: 
aspects  1 11)  and  (00)  in  addition  to  aspects  A  and  B.  For 
example,  the  probability  of  observing  aspect  (11),  instead  of 
aspect  A,  is  This  consideration  must  be  taken 

into  account  when  grouping  and  classifying  aspects  by  an 
interpretation  tree. 


6.  RELIABILITY  OF  SENSORS 

Once  a  sensor  feature  is  detected,  then  the  next  question  is 
how  reliable  the  sensor  feature  is.  This  section  discusses  two 
issues  of  uncertainty  in  feature  values.  The  first  issue  is 
uncertainty  in  sensory  measurement;  any  sensory  measure¬ 
ment  detected  by  a  sensor  always  contains  measurement  un¬ 
certainty.  To  determine  the  bound  of  the  uncertainty  is  im¬ 
portant  for  model  based  vision.  For  example,  suppose  there  is 
a  sensor  feature  for  which  the  geometric  model  takes  two 
nominal  values,  100  and  90,  for  two  distinct  situations.  If  a 
sensor  has  an  uncertainty  range  of  plus/minus  1  for  the  sensor 
feature,  we  can  use  the  feature  from  that  sensor  as  one  of 
reliable  discriminators  in  the  recognition  stage.  On  the  other 
hand,  if  a  sensor  has  an  uncertainty  range  of  plus/minus  20, 
we  cannot  use  the  feature  from  that  sensor. 

The  second  issue  is  propagation  of  uncertainty  from  sensor 
features  to  geometric  features,  hence  the  resulting  uncertainty 
of  those  geometric  features.  In  some  cases,  a  detected  sensor 
feature  from  a  sensor  is  used  directly  as  a  feature;  in  most 
cases,  however,  geometric  features  are  derived  from  sensor 
features.  Thus,  it  is  necessary  to  determine  the  uncertainty 
propagation  mechanism  for  determining  resulting  uncertainty 
in  geometric  features. 

6.1.  Uncertainty  in  Sensory  Measurement 

Uncertainty  in  sensor  measurement  comes  from  various 
reasons;  variance  in  brightness  values,  variance  in  light  source 
direction,  various  digitization  mechanisms.  However,  the 
major  uncertainty  of  an  intensity-based  sensor  comes  from 
brightness  variance,  and  the  major  uncertainty  of  a  position- 
based  sensor  comes  from  variance  in  light  source  as  show  n  in 
Table  4. 

Table  4  Main  factor  of  unreliability 
Sensor  Type  Factor 

Edge  Detector  Int  brightness  variance 

Shape-from-shading  Int  brightness  variance 

SAR  Int  flight  direction 

Time-of-Flight  Range  Finder  Pos  tnitror  direction 

Light-stripe  Range  Finder  Pos  mirror  direction 

Binocular  Stereo  Pos  camera  directions 

Trinocular  Stereo  Pos  camera  directions 

Photometric  Stereo  Int  brightness  variance 

Polarimetric  light  detector  Int  polarimetric  variance 
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Since  uncertainty  in  sensory  measurement  depends  on  the 
sensor,  we  will  analyze  the  light-stripe  range  finder  as  a 
representative  position-based  sensor  and  photometric  stereo  as 
a  representative  intensity-based  sensor. 

6.1.1.  Light-stripe  range  finder 

We  will  consider  a  depth  measurement  by  a  hypothetical 
light-stripe  range  finder.  Let  us  assume  that  the  main  source 
of  uncertainty  in  depth  measurement  comes  from  the  am¬ 
biguity  of  the  slit  position  on  a  surface  due  to  the  width  of  the 
light  beam  and  angular  errors  in  setting  the  light  directions. 
The  uncertainty  model  can  be  obtained  analytically. 

As  shown  in  Figure  7  (a),  let  us  denote  the  angular  uncer¬ 
tainty  of  the  light  stripe  by  50.  The  light  is  intercepted  by  an 
object  surface,  creating  a  slit  pattern  on  it.  The  angular  uncer¬ 
tainty  60  of  the  light  direction  results  in  uncertainty  5 y  in  the 
position  on  the  surface: 

6y=^L 

cosa 

where  r  is  the  distance  of  the  surface  from  the  light  source, 
and  a  is  the  angle  between  the  light  direction  S  and  the 
surface  normal  N.  This  positional  uncertainty  on  the  surface 
is  observed  as  the  slit  position  uncertainty  (or  "slit  width")  St 
in  the  camera  image.  If  P  is  the  angle  between  the  surface 
normal  N  and  the  viewer  direction  V,  then 

5i=(cos  P)5y, 

Finally,  this  uncertainty  is  transferred  into  the  uncertainty  in 
depth  measurement  by  triangulation.  For  simplicity,  if  we 
assume  orthographic  projection  for  the  camera,  the  uncer¬ 
tainty  in  the  image  Si  creates  uncertainty  in  depth  5z, 

tan  y 

where  y  is  the  angle  between  V  and  S. 

In  total,  by  representing  the  angles  a,  P,  and  y  in  terms  of 
V,  N,  and  S,  we  obtain 

5z  =  -  c0? j —  rS8  =  .(^IS_Xj_r60. 
cosatany  (N-SjVT^V 

Since  r  is  roughly  constant,  the  uncertainty  distribution  of  this 
light-stripe  range  finder  over  the  detectable  configurations  is 

governed  by  the  factor - . 
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Figure  7  (b)  plots  this  function.  The  left  sphere  represent 
the  feature  configuration  space.  The  detectable  configurations 
are  enclosed  by  two  great  circles.  Since  the  light-stripe  range 
finder  is  independent  on  the  rotation  of  die  z  axis  of  the 
feature  coordinate  system,  we  only  plot  uncertainty  for  the 
configurations  on  the  spherical  surface.  The  detectable  con¬ 
figurations  arc  projected  onto  the  tangential  plane  at  the  north 
pole.  The  right  diagram  represents  the  distribution  of  uncer¬ 
tainty  over  the  plane.  The  larger  die  angle  between  the  surface 
normal  and  the  illuminator  direction  is,  the  larger  die  uncer¬ 
tainty  we  have. 
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Figure  7:  Predicted  uncertainty  in  depth 
measurement  by  a  light-stripe  range 
finder:  (a)  Detection  mechanism;  (b) 

Predicted  uncertainty  in  depdi  measure¬ 
ment. 

6.1.2.  Photometric  stereo 

Let  consider  the  uncertainty  in  surface  orientation  by 
photometric  stereo.  Our  photometric  stereo  can  be  described 
as  a  two  step  process.  First,  a  brightness  triple  I  is  converted 
to  a  normalized  brightness  triple  E. 

E=F(I). 

Then,  the  normalized  brightness  triple  is  converted  to  a  sur¬ 
face  orientation  N. 

N=AK. 

Thus,  we  can  obtain  the  uncertainty  of  surface  orientadon 

where  J  is  the  jacobian  matrix  of  F. 

We  now  examine  the  jacobian  matrix  over  the  detectable 

i)/| 

configurations.  Figure  8  (a)  shows  the  distribution  of _ over 

ib, 

die  detectable  configurations,  where  die  detectable  configura- 
fions  arc  plotted  at  the  tangential  plane  at  the  nordi  pole  of  the 
feature  configuration  space.  Although  it  is  possible  to  ap¬ 
proximate  die  distribution  with  polynomial,  we  assume  it  is 
constant  0.004  over  the  detectable  configurations  for 
simplicity.  We  follow  the  same  method  for  the  other  com¬ 
ponent  of  the  jacobian  matrix. 
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0.004  0.002  0.002 
J=  0.002  0.004  0.002 
0.002  0.002  0.004 


We  determine  AdE  front  the  real  data,  because  A  is 
represented  as  a  lookup  table.  From  our  experiment,  our  TV 
camera  (8  bit)  has  standard  deviation;  3.  Thus,  taking  two 
sigma,  di= 6.  Suppose  only  one  light  source  causes  error  at 

one  time.  Then,  normalized  brightness  uncertainty, 

V<iE(dE=0.03.  Our  lookup  table  has  16  cells  in  one  normal¬ 
ized  brightness  axis  and  the  distance  between  the  maximum 
normalized  brightness  and  the  minimum  normalized  bright¬ 
ness  is  roughly  0.8  over  the  detectable  configurations.  0.03 
uncertainty  in  normalized  brightness  corresponds  to  1.5  mesh 
uncertainty  in  the  lookup  table. 

Next,  we  examine  the  relationship  between  one  cell  dis¬ 
tance  and  the  surface  orientation  difference.  Figure  5b  shows 
the  angular  distance  between  two  adjacent  cells  in  the  lookup 
table.  By  using  this  result  and  a  1.5  mesh  uncertainty  in  the 
normalized  brightness,  the  total  uncertainty  in  surface  orien¬ 
tation  becomes  5  degrees  over  the  detectable  configurations. 
In  Figure  5c,  the  horizontal  broken  lines  indicates  plus/minus 
5  degrees  uncertainty,  while  vertical  thick  lines  indicate  un¬ 
certainty  directly  measured  from  our  system.  Both  results 
agree  over  the  detectable  configurations. 

6.2.  Uncertainty  in  Geometric  Features 

Usually  sensory  measurements,  such  as  depth  detected  by  a 
range  finder,  are  further  converted  into  object  features  such  as 
area  and  inertia  of  a  face.  This  process  involves  grouping 
pixels  into  regions,  extracting  some  feature  values  and  trans¬ 
forming  them  into  object  features.  Modeling  uncertainty 
propagation  in  this  process  is  difficult  in  general,  but  as  an 
example  of  predicting  the  uncertainty  range  of  a  geometric 
feature,  let  us  consider  an  area  feature  of  a  face  detected  as  a 
region  by  our  hypothetical  light-stripe  range  finder.  Figure 
9(a)  shows  the  conversion  process  from  depth  values  to  the 
area  of  a  face.  The  process  includes  three  parts:  grouping 
pixels  into  a  region  to  obtain  the  area  in  the  image,  computing 
the  surface  orientation  of  the  region,  and  finally  converting 
the  image  area  into  the  surface  area  by  the  affine  transform 
determined  by  the  surface  orientation.  We  will  analyze  how 
the  uncertainty  are  introduced  and  propagated  in  these  three 
parts. 

Suppose  that  a  surface  under  consideration  has  the  real  area 
A  and  the  surface  orientation  p  (angle  between  the  surface 
normal  and  the  viewing  direction).  It  should  create  a  region 
of  size  n  pixels  where 

n  =  A  cvwP. 

However,  because  of  the  imperfect  detectability  of  the  sensor, 
the  sensor  fails  to  find  some  of  them,  and  the  measured  area 
will  be  different  from  the  nominal  area  n.  Let  Pd  denote  the 
detectability  for  this  surface  which  we  have  computed  in 
subsection  4.2.  Then,  the  process  of  measuring  the  area  by 
sampling  n  pixels  can  be  modeled  by  a  binomial  distribution 
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Figure  8:  Uncertainty  in  surface  orien¬ 
tation  by  photometric  stereo:  (a)  Distribu- 

3/[ 

tion  of  —  over  the  detectable  configura¬ 
tions;  (b)  Angular  difference  between 
two  adjacent  cells  in  the  lookup  table;  (c) 
Uncertainty  in  surface  orientation  over  the 
detectable  configurations. 

with  mean  nPd  and  variance  nPJ,\-Pd).  Assuming  two  stan¬ 
dard  deviations,  the  discrepancy  in  area  measurement  will  be 

Zn=n-\nPj-2dnPJ<\-Pd)) 

="(\~Pd)+2yfnP/\-Pd). 

Another  uncertainty  is  introduced  in  obtaining  the  surface 
orientation  p  from  measured  depths  due  to  uncertainty  in 
depth  measurement  5z.  If  we  estimate  the  surface  orientation 
at  a  pixel  by  differentiating  depths  of  neighboring  pixels,  then 
the  uncertainty  in  surface  orientation  will  be  coj2P5z. 
However,  since  we  have  roughly  n  pixels  in  the  region,  the 
surface  orientation  will  be  averaged,  reducing  the  uncertainty 
by  a  factor  Vii.  Thus 

§p  =  cos2Pg; 

Finally,  die  estimation  of  area  of  the  face,  A+5A,  is  ob¬ 
tained  by  converting  the  image  area  into  3D  surface  area. 


/1+5A  = 


w(P+5P) 
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Thus,  assuming  (hat  8p  is  small,  we  see  that 
A.pn-p.) 

6/1  =A(\~P .)  +  2V-  -  a  +  Ara/tB8B 


In  tliis  way,  we  can  predict  what  deviations  from  the  nominal 
value  of  the  area  feature  should  be  expected  once  we  model 
the  sensor  and  know  its  intrinsic  detectability  Pd  and  uncer¬ 
tainty  in  depth  8z. 
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Figure  9:  Conversion  process  from 
depth  value  values  to  the  area  of  a  face. 

In  order  to  examine  the  validity  of  the  propagation  model, 
we  executed  an  experiment  using  our  photometric  stereo.  We 
observed  a  face  five  times  by  our  photometric  stereo,  where 
the  face  was  inclined  30°  front  the  direction  of  TV  camera's 
axis.  From  die  measured  surface  orientation  and  projected 
pixel  number,  wc  recovered  the  face  area  and  face  inertia  by 
the  affine  transform  and  obtained  the  results  shown  in  the 
column  of  observed  in  Table  5. 

Our  photometric  stereo  has  plus/minus  5°,(or  0.0872 
radian)  uncertainty  in  surface  orientation  and  observed 
n  =  5 06  pixels  from  the  face.  This  gives  8p  =  0.00387.  Since 
we  measure  the  face  in  the  ideal  condition,  Pd  =  1.  We  insert 
these  parameters  in  the  above  equation  and  obtain  the  uncer¬ 
tainty  in  area.  In  order  to  predict  the  uncertainty  range  in 
inertia,  we  double  the  uncertainty  range  in  area.  These  values 
are  shown  in  the  column  of  predicted  in  Table  3. 


The  result  gives  the  consistency  between  the  predicted  un¬ 
certainty  and  the  real  uncertainty. 


Table  5  Reliability  of  Geometric  Features 

Feature 

Observed 

Predicted 

Area 

0.0023 

0.0022 

Inertia 

0.0044 

0.0043 

6.3.  Use  of  Uncertainty  for  Tree  Generation 

We  have  to  organize  the  structure  of  aspect  to  be  con¬ 
venient  for  applying  the  uncertainty  formula  derived  in  the 
previous  section.  We  represents  aspects  and  appearances  in 
the  frame  system.  An  example  of  an  aspect  is  shown  in 
Figure  10.  Each  appearance  frame  10  points  to  several  frames 
IMAGE-COMPOl  ,IM AG  E-COM  P02  corresponding  to  its 
visible  faces.  In  the  same  way,  ASPECT  I  points  to  several 
aspect  component  frames,  ASPECT-COM10, 
ASPECT-compI  I .  It  also  points  to  its  instance  images  10,  II, 
....  while  its  aspect  component  frame,  ASPECT-COMP  10 
points  to  its  instance  2D  faces  IMAGE-COMP  10, 
IMAGE-COMP  12 . 


If  ASPECT  1 
(IS  A  ASPECT) 

(IS-AN-IMAGE  OFASPECT-OF.INV 
■Oil  ) 

(IS  A  ASPECT  COUPE*  INV 
ASPECT  COUPI0 
ASPECT-COMPI!)  || 


(I  ASPECT-COMPio 

(IS  A  ASPECT  COUP) 

(IS  A  ASPECTCOMPOE  ASPECT!) 

(IS  A  IMAGE  COUP  Of  ASPECTOf.iNV 

IMAGE  GOMPO!  IMAGE  COMP  1 2) 

(IS  A  FACEOf  FACE4) 

(TWIS-ASPECT  HAS  a-BELATION 
ASPECT  COMP  HELATION  10-!!| 


II  ASPECT  COMP  RELATION-ll-io 
(IS  A  ASPECT  COUP-PELATION) 
IP  ISLAND  ASPECT COMPt!) 

IN  ISLANO  APSECT  COMP10)  || 


Figure  10:  Frame  representation  of 

aspects:  Each  aspect  structure  consists  of 
an  aspect  frame,  aspect  component 

frames,  and  aspect  component  relation 
frames.  An  aspect  frame  also  points  to  its 
instance  images  10,  II,  ....  while  its  aspect 

component  frame,  ASPECT-COMPIO 

points  to  its  instance  2D  faces 
IMAGE-COMPOl,  IMAGE-COMP  12. 


In  order  to  predict  the  uncertainly  of  various  feature  values 
at  each  aspect,  we  follow  the  pointers  between  aspect  com¬ 
ponent  frames  and  appearance  component  frames.  At  each 
appearance  component  frame,  since  a  nominal  value  of  a 
feature  and  its  configuration  with  respect  to  sensor  coor¬ 
dinates  are  given,  we  can  calculate  the  uncertainty  of  the 
feature  value  by  using  the  formula  described  in  the  previous 
section.  Tl  :n,  the  uncertainty  of  the  feature  value  at  an 
aspect  component  is  obtained  as  a  sum  of  the  uncertainties  of 
the  feature  values  over  its  registered  image  components.  The 
predicted  range  will  be  stored  in  the  slot  of  an  aspect  com¬ 
ponent  frame.  These  uncertainty  ranges  of  features  will  be 
retrieved  by  generation  ru'es  at  compile  time  to  generate  an 
interpretation  tree  and  by  the  execution  process  at  run  time  in 
recognizing  a  scene. 
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7.  AUTOMATIC  GENERATION  OF  OBJECT 
RECOGNITION  PROGRAM 

Figure  1 1  presents  an  example  of  how  to  apply  the  sensor 
model  to  the  automatic  generation  of  an  object  recognition 
program.  A  geometric  model  as  shown  in  Figure  11(b)  is 
obtained  from  a  toy  wagon  in  Figure  11(a).  From  the 
geometric  model  and  sensor  detectability,  we  can  generate 
various  possible  appearances  as  shown  in  Figure  11(c).  We 
can  classify  and  categorize  various  appearances  into  possible 
aspects,  where  each  aspect  shares  the  common  detectable 
faces.  One  aspect  structure  is  constructed  at  each  aspect  group 
(Figure  11(d)).  Predicted  ranges  of  uncertainty  of  geometric 
features  are  determined  using  the  sensor  reliability.  Genera¬ 
tion  rules  generate  the  recognition  strategy  automatically 
based  on  the  predicted  ranges  of  uncertainty  (Figure  11(e)). 
Once  the  recognition  strategy  is  obtained,  the  strategy  is  con¬ 
verted  into  an  executable  program  (Figure  1 1(f)). 

The  generated  program  is  applied  to  the  scene  as  shown  in 
Figure  12.  The  highest  region,  indicated  by  an  arrow,  is  given 
to  the  program,  while  Figure  12  (b)  shows  the  needle  map 
given  by  our  photometric  stereo.  The  black  nodes  in  Figure 
12  (c)  indicates  the  follow  of  control  in  the  real  run.  The 
program  classifies  the  region  to  the  corresponding  aspect  as 
shown  in  Figure  12(c). 

8.  CONCLUDING  REMARKS 

This  paper  discussed  modeling  sensors  and  how  to  apply 
sensor  modeling  to  automatic  generation  of  object  recognition 
programs.  Our  sensor  model  consists  of  two  characteristics: 
sensor  detectability  and  sensor  reliability.  Sensor  detec¬ 
tability  specifies  under  what  conditions  a  sensor  can  detect  a 
feature,  while  sensor  reliability  is  a  measure  of  confidence  for 
the  detected  features  over  the  detcetable  configurations. 

We  have  defined  the  configuration  space  which  represents 
the  relationship  between  sensor  coordinates  and  object  coor¬ 
dinates.  The  sensor  detectability  and  the  sensor  reliability  are 
expressed  in  this  configuration  space.  Constraints  in  the 
configuration  space  involved  in  detecting  features  have  been 
developed  by  using  G-source  illuminated  configurations  and 
G-source  illumination  directions.  We  have  shown  how  to 
compute  the  sensor  detectability  distribution  and  the  sensor 
reliability  distribution  for  two  representative  sensors:  a  light- 
stripe  range  finder  and  photometric  stereo. 

In  model  based  vision,  expected  values  of  various  features 
can  be  computed  from  3D  geometric  models.  Those  values 
are,  however,  nominal  values  that  should  be  taken  in  ideal 
cases  or  should  he  sensed  by  ideal  sensors.  On  the  other 
hand,  actually  observed  sensor  data  contains  noise  and  should 
be  used  accordingly.  The  sensor  model  bridges  the  dis¬ 
crepancy  between  these  two  values  by  modeling  the  distribu¬ 
tion  of  die  sensed  value  based  on  the  characteristics  of  a  given 
sensor. 

We  also  have  analyzed  the  uncertainty  propagation 
mechanism  from  sensory  measurements  to  geometric  features. 
This  is  important,  because  quite  often  we  are  interested  in 
geometric  features  derived  from  detected  measurement.  Once 
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Figure  11:  Sensor  model  and  automatic 
generation  of  object  recognition  program. 
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we  establish  the  method  to  model  the  uncertainty  propagation, 
we  can  also  assess  the  uncertainty  in  the  geometric  features, 
hence  we  can  construct  a  recognition  system  more  systemati¬ 
cally  and  reliably.  Further  study  is  required  in  this  area. 


To  calculate  detectable  features  of  an  object  under  the 
constraints  of  various  sensors  is  a  tedious  job  when  we  use  a 
conventional  geometric  modeler.  A  better  way  is  to  interface  a 
geometric  modeler  with  the  sensor  model  proposed.  We  call 
this  a  sensor  modeler.  The  traditional  geometric  modeler  only 
allows  users  to  generate  a  3D  object  by  combining  primitive 
objects  and  to  display  its  views.  In  this  sense,  the  traditional 
modeling  system  has  only  one  sensor  model,  which  is  projec¬ 
tion.  The  sensor  modeler  we  propose  can  generate  various  2D 
representations  under  given  sensor  specifications.  Part  of  this 
facility  is  being  implemented  in  our  new  geometric/sensor 
modeler  VANTAGE29. 
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Figure  12:  Tree  execution:  (a)  Scene; 
(b)  surface  orientation  distribution  of  the 
scene;  (c)  Execution  result.  Black  nodes 
indicates  the  trace  of  the  control.  The 
target  region  is  classifed  into  the  cor¬ 
responding  aspect. 
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I.  Feature  coordinate  system 


We  define  the  z  axis  of  the  feature  coordinate  system  to 
agree  with  the  surface  normal,  and  the  x-y  axes  lie  on  the  face, 
but  are  defined  arbitrarily  otherwise. 


We  define  the  z  axis  to  agree  with  the  average  direction  of 
the  two  normals  of  the  two  adjacent  faces  incident  to  the  edge. 
We  define  the  x  axis  of  the  feature  coordinate  system  to  agree 
with  the  edge  direction.  The  y  axis  is  determined  accordingly. 


We  define  the  z  axis  to  agree  with  the  average  direction  of 
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the  normals  of  the  adjacent  faces  incident  to  the  vertex.  The 
x-y  axes  lie  on  the  plane  perpendicular  to  the  r  axis,  but  are 
defined  arbitrarily  otherwise. 

II.  G-source  definition 

We  will  define  two  kinds  of  G-sources  in  terms  of  the 
distribution  of  illuminated  configurations:  the  uniform  G- 
source  and  the  directional  G-source.  A  uniform  G-source  dis¬ 
tributes  its  light  evenly  in  all  directions.  An  example  of  a 
uniform  G-source  is  a  usual  light  source  whose  illuminated 
configurations  are  located  as  a  hemispherical  cone  of  the 
source  direction. 

Another  kind  of  G-source  is  a  directional  G-source  which 
projects  light  depending  on  the  rotation  around  the  light 
source  direction.  An  example  of  a  directional  source  is  a 
directional  edge  detector.  Note  that  a  detector  is  also  con¬ 
sidered  as  a  G-source.  Since  die  directional  edge  detector 
only  detects  edges  with  certain  orientations,  it  is  regarded  as  a 
directional  source.  The  illuminated  configurations  of  a  direc¬ 
tional  source  become  a  thin  slice  in  the  configuration  space. 

Uniform  G-source 

We  specify  a  uniform  G-source  as 
(NS  type  direction  angle ) 

The  first  argument,  type,  specifies  what  kind  of  feature  the 
G-source  illuminates,  and  takes  one  of  the  values;  face,  edge, 
and  vertex.  The  second  argument,  direction,  denotes  the 
G-source  illumination  direction  as  a  vector.  The  third  ar¬ 
gument,  angle  defines  the  illuminated  configurations  by 
specifying  the  apex  angle  between  the  illumination  direction 
and  the  z  axis  of  the  feature  coordinate.  If  type  is  face,  this 
angle  defines  the  maximum  allowable  angle  between  the  face 
surface  normal  and  the  illumination  direction.  If  type  is  edge, 
this  angle  defines  the  maximum  allowable  angle  of  the 
smaller  one  of  the  two  angles  between  the  illumination  direc¬ 
tion  and  the  two  z  axes  of  incident  surfaces  to  the  edge.  That 
is,  if  either  or  both  faces  are  well  illuminated,  then  the  edge  is 
considered  to  be  illuminated.  If  type  is  vertex,  we  have  to 
consider  at  least  diree  faces  incident  to  the  vertex.  This  angle 
defines  the  maximum  allowable  angle  of  the  smallest  angle  of 
those  angles  between  the  illumination  direction  and  the  z  axes 
of  incident  surfaces.  That  is,  if  any  of  the  incident  faces  of 
the  vertex  is  illuminated,  the  vertex  is  considered  to  be  il¬ 
luminated. 

Directional  (■  ;ource 

We  specify  a  directional  G-source  as 
(DS  type  direction  angle  spec-direction  spec-angle) 

The  first  argument,  type,  specifies  one  of  the  object  features: 
face,  edge,  and  vertex.  The  second  argument,  direction, 
denotes  the  G-source  illumination  direction  as  a  vector.  The 


third  argument,  angle,  defines  the  spherical  angle  of  the  il¬ 
luminated  configurations,  as  for  the  uniform  G-source.  The 
fourth  argument,  spec-direction  defines  the  constraint  direc¬ 
tion  to  be  used  in  the  following  argument.  If  the  type  is  face, 
the  2  axis  direction  is  used.  If  the  type  is  edge,  the  x  axis 
direction  is  used.  If  the  type  is  vertex,  the  z  axis  direction  is 

used.  The  fifth  argument,  spec-angle  defines  the  maximum 
allowable  angle  between  the  constraint  direction  and  the  prin¬ 
cipal  direction  i.e.  the  surface  normal  of  a  face,  the  edge 
direction  of  an  edge,  and  the  average  surface  orientation 
around  a  vertex. 
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Abstract 

An  object  recognition  system  is  presented  to  handle  the  compu¬ 
tational  complexity  posed  by  a  large  model  base,  an  unconstrained 
viewpoint,  and  the  structural  complexity  and  detail  inherent  in  the 
projection  of  an  object.  The  design  is  based  on  two  ideas.  The  first  is 
to  compute  descriptions  of  what  the  objects  should  look  like  in  the  im¬ 
age,  called  predictions,  before  the  recognition  task  begins.  This  reduces 
actual  recognition  to  a  2D  matching  process,  improving  the  efficiency 
of  3D  object  recogntion  during  run-time. 

The  second  design  feature  is  to  represent  all  the  predictions  by 
a  single,  combined  IS-A  and  PART-OF  hierarchy  called  a  prediction 
hierarchy  that  speeds  up  the  indexing,  or  selection,  of  the  correct  object 
and  viewpoint.  The  nodes  in  this  hierarchy  are  part  ial  descriptions  that 
are  common  to  views  and  hence  constitute  shared  processing  subgoals 
during  matching.  The  recognition  time  and  storage  demands  of  large 
model  bases  and  complex  models  are  substantially  reduced  by  subgoal 
sharing:  projections  with  similarities  explicitly  share  the  recognition 
and  representation  of  their  common  aspects. 

A  prototype  system  for  the  automatic  compilation  of  a  prediction 
hierarchy  from  a  3D  model  base  is  demonstrated  using  a  set  of  polyhe¬ 
dral  objects  and  projections  from  an  unconstrained  range  of  viewpoints. 
In  addition,  a  recognition  system  is  proposed  that  can  efficiently  search 
for  matches  to  the  object-specific  goal  nodes  of  the  hierarchy. 


1.  Introduction 

Object  recognition  is  a  central  problem  in  computer  vision. 
It  is  also  a  very  difficult  problem,  both  because  of  the  vast  num¬ 
ber  of  potential  objects  that  a  fully  general  vision  system  must 
be  able  to  recognize,  and  because  of  the  multitude  of  appear¬ 
ances  that  each  object  can  present  when  viewed  from  different 
vantage  points.  In  addition,  most  practical  applications  would 
demand  that  object  recognition  be  accomplished  very  rapidly. 
That  human9  can,  at  a  glance,  recognize  an  object  from  the  tens 
of  thousands  of  familiar  objects  [1]  stands  as  a  challenge  to  com¬ 
puter  vision. 

To  date,  significant  progress  has  been  made  in  model-based 
object  recognition.  (See  [3,4,1 1,1 2,1 3j  for  some  of  the  most  recent, 
efforts,  and  [6]  for  a  more  comprehensive  review.)  However,  none 
of  this  work  has  fully  addressed  the  problem  of  rapid  recognition 
out  of  iarge  model  bases.  Indeed,  the  methods  used  by  most 
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model-based  object-recognition  systems,  while  often  reasonably 
successful  within  their  own  terms,  could  net  be  scaled  up  to  deal 
with  large  model  bases,  that  is,  those  containing  thousands  of 
objects. 

We  present  here  an  approach  which  explicitly  addresses  the 
problem  of  large  model  bases.  While  we  do  not  claim  to  have 
fully  solved  this  problem,  our  analysis  and  experience  with  a 
prototype  system  indicate  that  this  approach  is  feasible. 

2.  Overview 

Figure  I  shows  a  block  diagram  of  the  basic  recognition  sys¬ 
tem  design.  The  three  main  modules  (the  boxes)  are  discussed 
in  the  subsections  that  follow. 

2.1  Predictions  and  the  Prediction  Hierarchy  Com 
piler 

The  prediction  hierarchy  compiler  embodies  two  major  as¬ 
pects  to  our  approach.  The  first  is  that  before  recognition,  in 
an  “off-line”  process,  we  precompile  from  the  3D  model  base  de¬ 
scriptions  of  the  potential  appearance  of  all  the  objects  from  all 
possible  vantage  points.  (This  is  actually  something  of  a  concep¬ 
tual  simplification:  what  is  actually  done  will  be  described  later.) 
These  descriptions  constitute  predictions  about  the  2D  image  ap¬ 
pearance  of  the  objects.  By  doing  this  we  avoid,  at  recognition 
time,  the  cost  of  the  complicated  3D-to-2D  transformations  be¬ 
tween  3D  models  and  2D  images.  The  cost  of  these  computations 
in  the  general  case  is  one  of  the  chief  difficulties  with  ACRONYM 
[4].  In  our  approach  the  matching  at  recognition  time  is  reduced 
to  a  2D-to-2D  problem.  This  precomputation  of  predictions  is 
possible  because,  while  an  object  potentially  has  an  infinitude 
of  different  exact  appearances,  only  a  finite  number  of  these  are 
different  in  a  qualitative  sense  [9,14],  and  for  many  objects,  the 
number  of  these  qualitatively  different  views  (“generic  views”) 
is  reasonably  small. 

Even  so,  the  number  of  such  views  over  a  large  model  base  can 
still  be  very  great.  This  leads  to  the  second  component  of  our  ap¬ 
proach,  which  is  to  organize  these  predictions  into  a  hierarchical 
structure  based  on  PART-OF  (aggregation)  and  IS-A  (special¬ 
ization)  links.  This  way,  the  common  aspects  of  the  predictions 
are  made  explicit  and  represented  just  once  in  the  hierarchy.  One 
advantage  of  this  is  that  the  set  of  predictions  can  thereby  be 
stored  more  compactly.  But  more  important,  predictions  com¬ 
mon  to  many  objects  need  be  matched  only  once  during  the  later 
recognition  step,  effecting  <x  considerable  economy  in  matching. 
Ideally,  the  hierarchy  should  act  like  an  efficient  discrimination 
net.  permitting  rapid  indexing  to  an  object-specific  prediction. 
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2.2  Recognition-Time  Matching 


Fiji  1.  The  basic  recognition  system  design. 

In  this  way  our  prediction  hierarchy  provides  a  single,  uniform 
mechanism  for  coping  both  with  variation  and  similarity  across 
objects  and  across  views  of  a  single  object. 

In  reality,  these  two  operations,  of  prediction  and  organiza¬ 
tion,  are  not  run  separately,  but  are  intertwined.  Predictions  are 
generated  and  incorporated  into  the  hierarchy  incrementally,  as 
needed.  This  has  two  related  advantages.  First  is  that  a  po¬ 
tentially  unmanageable,  unorganized  collection  of  predictions  is 
never  actually  created;  second  is  that  the  predictions  generated 
need  only  be  elaborated  sufficiently  to  ensure  adequate  discrim¬ 
ination  between  objects  (and  distinct  views).  The  full,  detailed 
description  of  the  generic  views  of  the  objects  normally  need  not 
be  generated;  only  general  predictions  of  appearance  sufficient 
to  adequately  discriminate  object  views  one  from  another  are 
needed.  Indeed,  in  the  system  as  implemented,  the  concept  of 
generic  view  does  not  appear  at  all. 

Out  of  these  twin  processes  of  prediction  and  organization 
comes  a  prediction  hierarchy,  which  encapsulates  in  a  directly 
usable  form  the  knowledge  needed  to  visually  recognize  objects 
out  of  the  3D  model  base  Notice  that,  while  the  construction 
of  the  prediction  hierarchy  may  be  a  computationally  expensive 
process,  it  need  be  done  only  a  single  time  for  any  given  model 
base.  Once  the  prediction  hierarchy  is  compiled,  it  can  be  used 
over  and  over  again  for  recognizing  objects  out  of  that  model 
base 


The  actual  recognition  of  an  object  in  an  image  involves 
matching  the  hierarchically  organized  predictions  against  per¬ 
ceptual  features  extracted  from  that  image.  In  the  prototype 
system  described  here,  matching  takes  place  systematically  from 
most-general  predictions  (shared  by  many  objects)  to  object- 
view-specific  predictions,  thereby  ensuring  effective  indexing  from 
the  image  data  into  the  model  base. 

2.3  Pose  Estimation  and  Verification 

In  most  applications  of  object  recognition  it  is  necessary  not 
only  to  identify  the  viewed  object,  but  also  to  determine  its  pose 
(position  and  orientation)  with  respect  to  the  camera.  Matching 
using  the  prediction  hierarchy  may  only  determine  the  approxi¬ 
mate  view  of  an  object.  However,  it  does  facilitate  exact  determi¬ 
nation  of  object  pose.  The  correspondence  between  the  matched 
prediction  and  the  image  features  can  be  related  back  through 
the  prediction  to  the  original  3D  object  model,  giving  a  cor¬ 
respondence  between  the  image  features  and  the  object  model. 
Prom  this  information  can  be  determined  the  parameters  of  the 
3D-to-2D  transformation  that  best  matches  the  3D  object  fea¬ 
tures  to  their  corresponding  2D  image  features.  What  is  more, 
the  approximate  view  associated  with  the  matched  prediction 
can  be  used  as  a  good  starting  point  for  iterative  methods  of  solu¬ 
tion  for  the  pose  parameters,  such  as  is  used  in  Lowe’s  SCERPO 
system  [13) . 

The  exact  determination  of  pose  also  provides  verification 
of  the  object  identification  achieved  by  matching  through  the 
prediction  hierarchy.  This  is  important  because  it  means  that 
the  discriminations  made  by  the  prediction  hierarchy  need  not 
be  perfect.  It  is  sufficient  that  matching  through  the  prediction 
hierarchy  be  conservative,  that  it  not  rule  out  any  true  object 
identifications.  Any  objects  and  views  erroneously  indexed  by 
the  prediction  matching  can  be  ruled  out  by  the  pose  estimation 
and  verification.  Of  course,  the  more  false  indexing  ruled  out  by 
the  prediction  matching,  the  less  work  there  is  to  be  done  by  the 
pose  estimation  and  verification,  but  there  is  clearly  a  trade-off 
involved  here.  At  a  certain  point  it  may  be  more  efficient  to  take 
a  matched  general  prediction  and  verify  the  objects  it  implies 
one  by  one  through  pose  verification,  than  to  further  extend  the 
match  to  an  object-specific  prediction.  One  advantage  of  our 
approach  is  that  this  cut-off  point  could  be  determined  during 
the  prediction  compilation  It  is  our  belief  that  normally  this  cut¬ 
off  point  is  quite  late,  that  is,  the  goal  nodes  of  the  prediction 
hierarchy  would  be  specific  to  an  object-view,  or  at  most  to  a 
few  object-views. 

Pose  estimation  for  verification  is  related  to  Lowe's  use  of  the 
viewpoint  consistency  consfratn(!(l3|,  although  he  uses  it  only  in 
the  context  of  single-object  recognition 

2.4  The  Domain 

The  domain  we  have  chosen  is  polyhedral  objects  with  linear 
surface  markings.  I’olyhedra  form  a  useful  and  rich  class  of  ob¬ 
jects  for  recognition  experiments,  while  their  3D  modeling  and 
rendering  are  fairly  well  understood.  In  particular,  the  edges 
and  linear  surface  markings  of  polyhedra  project  into  straight- 
line  segments  in  an  image,  and  good  techniques  for  extracting 
such  st.ra  ght.-line  features  from  images  are  readily  available  |2,.r>] 
Note,  however,  that  our  overall  approach  is  in  no  wav  restricted 
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as  a  relational  graph;  the  elements  in  the  graph  are  projected 
straight-line  segments.  The  relations  associated  with  arcs  in  (he 
graph  mutually  constrain  the  relative  orientations  positions  and 
sizes  of  pairs  of  line  segments.  For  example,  the  relation  parallel 
constrains  two  segments  to  having  the  same  orientation.  The 
predictions  can  be  viewed  as  conjunctions  of  these  relations  over 
formal  segment  arguments. 

The  objects  are  expected  in  any  pose,  hence  all  predictions 
generated  are  invariant  to  re-scaling,  translation  and  rotation 
in  the  image  plane.  Because  of  this  and  the  type  of  projection 
assumed,  there  are  only  two  degrees  of  camera  variation  that 
have  a  significant  effect  on  the  projections  being  described  by 
the  predictions:  the  two  angular  components  of  the  viewpoint 
that  sweep  out  a  viewing  sphere  about  the  object. 

Any  given  prediction  may  be  satisfied  by  the  projection  of 
more  than  one  object  in  the  model  base  and,  in  fact,  it  may  be 
satisfied  by  the  projection  of  more  than  one  set  of  line  segments 
on  the  same  object.  This  is  especially  true  for  simple  structural 
statements,  such  as  the  parallelogram  prediction  in  Figure  3.  For 
each  such  occurrence,  or  instance,  of  a  prediction,  there  may  be 
a  wide  range  of  viewpoints  for  which  the  set  of  projected  line 
segments  satisfy  the  prediction.  For  example,  the  projections 
of  the  3D  line  segments  a.6,c,  and  d  of  the  triangular  prism 
model  (Figure  3),  satisfy  the  parallelogram  prediction  over  all  the 
viewpoints  for  which  these  segments  are  simultaneously  visible 
(half  the  viewing  sphere). 

More  specifically,  we  define  a  prediction  instance  as  a  one- 
to-one  mapping  from  a  set  of  projected  3D  model  segments  to 
the  elements  of  the  prediction’s  relational  graph  and  the  set  of 


to  polyhedra:  in  concurrent  work  at  UMass,  generating  predic¬ 
tions  for  curved  objects  is  also  being  investigated  [9,10,14]. 

The  actual  objects  used  to  demonstrate  the  design  are  shown 
in  Figure  2.  The  objects  have  differences  and  similarities  in  var¬ 
ious  dimensions  and  part  of  the  problem  is  to  take  advantage 
of  both.  The  similarities  in  their  visual  structure,  such  as  oc¬ 
currences  of  parallelograms  or  of  certain  types  of  line  junctions, 
must  be  utilized  by  the  recognizer  lo  make  the  search  for  a  match 
efficient.  The  differences  in  visual  structure,  such  as  height-to- 
width  proportions,  must  be  utilized  to  discriminate  between  the 
objects  Additionally,  the  variations  in  visual  appearance  caused 
by  variations  in  the  camera  must  be  taken  into  account  while  do¬ 
ing  the  structural  analysis. 

3.  Predictions 

For  the  construction  of  predictions  our  camera  model  is  so- 
called  normal  perspective,  following  Brooks  [*||.  This  is  equiva¬ 
lent  to  orthographic  projection  with  an  additional  image  scaling 
factor.  It  provides  a  simple  but  useful  approximation  to  true 
perspective  projection  so  long  as  the  camera  is  not  too  close  to 
the  object.  (Note  that  the  use  of  normal  perspective  for  pre¬ 
dictions  does  not  preclude  using  true  perspective  for  the  pose 
estimation  and  verification  step.) 

A  prediction  is  a  statement  concerning  some  structural  aspect 
of  the  image  of  an  object.  For  example,  this  may  be  as  simple 
and  general  as  an  assertion  that  there  exists  a  pair  of  parallel 
segments  in  the  projection;  or  as  complex  as  a  description  of  an 
image  unique  to  some  object  A  prediction  is  represented  here 
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l  ie.  3.  Example  of  a  prediction  instance:  (a)  the 
predict  ion’*:  relational  graph  (■•  arcs  “coincident  end- 
points"  and  ;•  arc?  “parallel");  (h)  3D  model  segments: 
(<)  mapping  between  prediction  elements  and  projected 
model  segments;  and  (d)  range  of  views  for  which  map¬ 
ping  satisfies  prediction  (shaded  region). 


min-U  <  u  <  max.u 
min  v  <  v  <  max  .v 
min  a  <  a  <  max.  a 
min  s  <  s  <  max.s 

Fig.  4.  Representation  of  image  segment  relations. 

viewpoints  (the  view  region)  on  the  viewing  sphere  from  which 
the  prediction  is  valid  for  these  segment-element  bindings.  For  a 
given  model  base,  t  here  is  a  set  of  such  instances  associated  with 
each  prediction  During  the  compilation  of  the  prediction  hierar¬ 
chy,  it  is  important  to  keep  track  of  all  instances  of  a  prediction 
and  their  associated  view  regions 

A  relation  between  a  pair  of  projected  segments  used  in  the 
predictions  is  represented  by  ranges  of  four  relational  measures, 
u.  t\  a  and  s  (see  Figure  4)  Let  us  consider  two  segments,  sj 
and  sj.  1  tie  vector  (u.  r)  is  the  position  of  segment  relative  to 
S|-  it  is  the  displacement  of  an  endpoint  of  sj  from  an  endpoint 
of  S|  measured  along  .<i  and  normal  to  S|.  divided  by  the  length 
of  si  The  angle  between  them,  o,  is  measured  counterclockwise 
from  si  to  :  and  s  is  the  relative  scale  or  length  ratio  of  sj  over 
S],  A  relation  is  defined  as  an  extent  box.  i  e.  intervals  in  u.  r,  n 
and  s,  in  order  to  capture  the  variation  over  ranges  in  viewpoint 
For  instance,  project  ing  a  pair  of  parallel  object  segments  over  all 
possible  viewpoints  will  generate  a  set  of  measurements  that  have 
a  single  value  (zero)  in  the  a  dimension,  but  some  non-trivial 
extent  in  the  others.  Similarly,  the  famdy  of  projections  of  a  pair 
of  object  segments  that  share  an  endpoint  can  be  represented  by 
a  relation  that  has  the  value  zero  in  b  1  i  position  components 

(u.n). 

A  relation  between  projected  segments  is  considered  useful 
if  it  is  valid  over  a  wide  range  in  viewpoint s  ami  its  extent  box 
is  small  in  volume  (for  example,  consider  the  two  view-invariant 
relations  just  mentioned)  The  latter  property  is  important  if  the 
relation  is  to  help  in  the  indexing  process;  that  is,  to  characterize 
an  object’s  projection  with  a  specificity  sufficient,  to  discriminate 
the  object  from  a  large  number  of  other  objects  and  from  chance 
arrangements  of  image  segments  Although  invariant  relations 
are  clearly  useful  'I3j.  alone  they  are  not  in  general  sufficient 
to  fully  characterize  projections  For  instance,  proportions  are 
often  st  rong  characterizations  of  object  st  ruct  ure.  but  the  lengt  li 
measurement  ratios  thn*  represent  them  are  often  not  strictly 
view  invariant  For  example,  the  tall  box  in  Figure  2  has  a 
height  to  width  ratio  that  is  significantly  different  from  the  cube 


over  a  large  range  in  views  and  no  other  property  can  be  used 
to  differentiate  them. 

Any  given  prediction  in  the  prediction  hierarchy  has  an  asso¬ 
ciated  relational  graph,  but  explicitly  it  is  almost  always  repre¬ 
sented  as  some  combination  or  specialization  of  other  predictions 
(see  Figure  5).  A  prediction  is  a  specialization  of  another  if  it 
can  be  described  by  adding  new  relations  or  narrowing  the  ex¬ 
tent  boxes  of  existing  ones  (i.e.,  tightening  the  constraints).  For 
example,  the  rhombus  can  be  described  by  adding  a  relation 
between  the  bottom  and  side  segments  of  the  more  general  par¬ 
allelogram  prediction  that  constrains  the  ratio  of  their  length 
to  be  one.  A  prediction  is  a  combination  of  other  predictions  if 
it  can  be  described  as  a  conjunction  of  these  other  predictions. 
Predictions  may  be  combined  in  various  ways,  depending  on  the 
mappings  between  the  relational  graph  elements  of  the  part  pre¬ 
dictions  and  those  of  the  whole  combination  Figure  f,  shows  an 
example  of  how  the  nature  of  the  combination  prediction  can 


(define  parallelogram  (si  s2  s3  s4) 
(and  (coincident  sl-head  s2-head) 
(coincident  s2-tail  s3-head) 
(coincident  sl-tail  s4-head) 
(coincident  s3-tail  s4-tail) 
(parallel  sl-tail  s3-tail) 
(parallel  s2-tail  s4-tail) 

)  ) 

(define  triangle  (si  s2  s3) 

(and  (coincident  sl-head  s2-tail) 
(coincident  s2-head  s3-tail) 
(coincident  s3-head  sl-tail) 

)  ) 

(define  triangular-prism  (si  s2  ..  s8) 
(and  (parallelogram  si  s2  s3  s4 ) 
(parallelogram  s5  s2  s6  s7) 
(triangle  s3  s7  s8) 

(define  rhombus  (si  s2  s3  s4) 

(and  (parallelogram  si  s2  s3  s 4  ) 
(same-size  1  2) 

)  ) 

Fig  *i.  Prediction?  a?  specializations  or  rninhin.itinii? 
of  oilier  predictions.  (Multiple  link?  between  mules  not 
shown.) 
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change  by  varying  these  part-whole  maps,  which  are  represented 
by  the  arguments  passed  into  the  part  predictions  and  the  or¬ 
dering  of  these  arguments. 

Note  that,  because  of  the  combination  predictions,  the  “hier¬ 
archy”  is  actually  a  directed  acyclic  graph,  where  the  predictions 
are  nodes ,  and  the  specializations  and  combinations  are  succes¬ 
sors  of  the  nodes  they  are  created  from  (their  predecessors). 
For  the  prototype  system  presented  in  this  paper,  the  hierar¬ 
chy  compiler  and  matcher  have  only  been  developed  to  use  the 
combination  process.  A  compiler  that  uses  specialization  will  be 
presented  in  forthcoming  work  [8] .  Also,  the  compiler  currently 
generates  combinations  that  are  limited  to  pairs  of  parts. 

4.  Prediction  Hierarchy  Compilation 

In  Section  3  it  was  shown  that  a  prediction  hierarchy  is  com¬ 
posed  of  predictions  that  are  combinations  or  specializations  of 
other  predictions.  The  primitive,  or  base-level  predictions  of  the 
hierarchy  (those  that  all  the  other  predictions  are  derived  from) 


(and  (coincident  1  2) 
(coincident  2  3) 
(parallel  1  3) 

) 


^  (and  (U-shape  123) 

(U-shape  234) 
(and  (U-shape  123)  ) 


(U-shape  345) 
) 


(and  (U-shape  123)  (U-shape  435) 

( U-shape  143)  ) 

) 


Fig  <».  An  example  of  how  diHeieni  |>rui-whnlp  maps 
<an  create  different  combination-,  (a)  1’*hnpr  predic¬ 
tion.  (h)  predictions  created  by  the  rctinhin.it ion  of  J  . 
*hnpr  with  under  different  part -whole  map*.  These 

are  represented  by  arguments  passed  into  t lir»  piedecessoj 
prediction  (  t'-.fhnpr)  and  the  ordeting  of  these  arguments. 


ought  to  be  those  that  are  valid  for  very  large  numbers  of  objects 
and  are  essentially  invariant  to  viewpoint;  such  predictions  are 
like  the  invariant  binary  relations  discussed  above  and  in  (13). 
The  final  or  goal  predictions  ought  to  be  valid  for  only  a  few 
objects  and  over  a  narrow  enough  range  of  viewpoints  that  pose 
estimate  refinement  and  verification  can  be  performed  effectively. 

A  useful  way  to  build  such  a  structure  is  to  start  with  a  small 
set.  of  simple,  ubiquitous  predictions  and  then  iteratively  search 
for  useful  combinations  or  specializations,  eventually  isolating 
predictions  that  characterize  the  projections  of  specific  objects. 
This  builds  the  hierarchy  in  the  direction  in  which  the  index¬ 
ing  process  proceeds  through  it,  generating  successors  that  seem 
most  useful  for  discriminating  the  possibilities  associated  with 
each  current  node. 

Though  the  major  concern  of  this  work  is  the  efficiency  of  the 
recognition  process,  the  pre-recognition  compiler  design  should 
also  avoid  combinatorial  explosions  that,  for  a  fair-sized  model 
base,  would  keep  it  running  until  the  next  century.  By  using  this 
iterative  construction  approach,  we  limit  the  combinatorial  com¬ 
plexity,  and  hence  processing  time,  to  a  manageable  level.  This 
is  because  the  system  is  never  performing  isomorphism  analysis 
over  large  graphs:  it  is  always  comparing  combinations  of  small 
numbers  of  parts. 

4.1  Criteria  for  Selecting  Successors 

In  building  the  prediction  hierarchy,  it  is  important  to  under¬ 
stand  which  of  the  possible  combinations  and  specializations  of 
simpler  predictions  are  useful  to  explicitly  represent  as  nodes  in 
the  hierarchy.  For  maximum  efficiency  in  the  search  for  the  exact 
object  and  view,  the  hierarchy  should  be  in  the  form  of  a  binary 
search  structure.  To  achieve  this,  each  new  node  added  by  the 
compiler  should  be  satisfied  by  exactly  half  of  the  instances  of  its 
predecessors  (where  instances  are  the  possible  segment-element 
maps  of  a  prediction  described  in  Section  3).  This  would  make 
the  new  node  an  ideal  test  in  a  binary  search  process. 

This  is  roughly  what  the  compiler  strives  to  build,  though 
there  is  an  additional  consideration  that  changes  the  picture 
somewhat.  The  absence  of  a  match  to  a  prediction  is  not  con¬ 
sidered  here  as  evidence  for  the  presence  of  alternative  objects 
or  views.  For  each  alternative  subset  of  possibilities  (instances), 
there  should  be  an  associated  successor  that  requires  additional 
positive  evidence  in  the  image  in  order  to  be  satisfied,  i.e.,  some 
new  image  part  that  needs  to  be  present.  This  is  because  the 
absence  of  a  match  could  be  due  to  noise,  instead  of  the  absence 
of  the  object;  and  noise  is  more  likely  to  prevent  matches  than 
to  create  spurious  matches  to  a  non-trivial  relational  graph  de¬ 
scription.  For  example,  if  the  recognizer  has  reached  node  8  in 
Figure  7  and  is  trying  to  decide  whether  the  object  projected  is 
a  triangular  prism  or  something  else,  the  absence  of  the  triangle 
match  does  not  necessarily  mean  that  it  is  a  project  ion  of  some 
other  object  ,  but  the  presence  of  an  additional  parallelogram  in 
place  of  the  triangle  is  strong  evidence  that  it  is  indeed  another 
object. 

Thus,  the  compiler  does  not  try  to  add  predictions  whose 
satisfaction  acts  as  a  discrimination  test.  Instead,  the  compiler 
associates  each  subset  of  instances  with  a  distinct,  combination 
successor  The  number  of  successors  per  prediction  (i.e  ,  the 
branching  factor  of  the  hierarchy)  should  be  kept  low  since  large 
branching  factors  can  slow  the  match  search  process  down  In 
addition,  it  is  important  t  hat  every  instance  of  a  prediction  be  as- 
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sociated  with  some  successor  of  this  prediction.  Whenever  there 
is  a  match  to  a  non-goal  prediction,  there  should  also  be  a  match 
to  at  least  one  of  its  successors,  so  that  there  can  be  some  path, 
via  a  sequence  of  matches,  that  culminates  in  a  goal  prediction 
match.  An  instance  is  covered  if  there  is  a  successor  that  is  also 
satisfied  by  the  same  same  object,  same  view  and  compatible 
segment-element  bindings  (given  the  part-whole  maps  relating 
predecessor  elements  to  successor  elements). 

In  the  ideal  case  then,  there  would  be  two  successors  per 
node,  each  with  positive  evidence  (such  as  the  triangle  or  par¬ 
allelogram  above),  and  together  they  partition  the  prediction 
instances  exactly  in  half  This  ideal  case  cannot  always  be  met: 
there  may  not  be  successors  that  are  satisfied  by  half  of  the  in¬ 
stances,  but  only  by  somewhat  smaller  or  larger  fractions;  and 
there  may  not  be  a  set  of  successors  that  partition  the  instances 
into  disjoint  subsets.  Nevertheless,  at  each  iteration  the  com¬ 
piler  should  try  to  select  as  small  a  set  of  successors  as  possible 
that  still  cover  all  the  instances  of  the  predictions  generated  the 
iteration  before. 

4.2  Iterative  Hierarchy  Construction 

For  the  experiment  reported  here,  parallelism  and  endpoint 
coincidence  are  used  as  the  initial  set  of  primitive  predictions 
since  they  occur  frequently  in  the  objects  studied  here,  they  can 
be  rapidly  matched  and  their  combinations  can  be  used  to  dis¬ 
criminate  between  most  of  the  objects  and  views.  The  iterative 
construction  process  stops  when  all  of  th  .  redictions  without 
successors  are  either  associated  with  single  objects  or,  if  they 
are  satisfied  by  projections  of  more  than  one  object,  there  are  no 
differences  between  their  projections  that  the  compiler  is  capable 
<<r  representing. 

For  each  iteration  of  the  combination  process,  the  system 
isolates  useful  combinations  by  (1)  finding  predictions  that  often 
appear  together  in  the  same  projections,  and  then  (2)  selecting  a 
subset  of  these  commonly  occurring  combinations  that  approxi¬ 
mately  satisfy  the  criteria  discussed  in  Section  4.1 

The  co-occurrences  between  predictions  are  found  by  noting 
the  amount  of  overlap  in  the  view  regions  of  their  instances.  All 
instances  of  all  predictions  are  stored  in  data  structures  called 
visibility  maps.  There  is  a  visibility  map  for  each  object;  the 
maps  are  arrays  of  cells  indexed  by  the  two  viewpoint  parame¬ 
ters,  making  a  discrete  sampling  of  the  view  sphere  about  the 
object.  Each  cell  lists  prediction  instances  visible  from  the  as¬ 
sociated  viewpoint  and  object;  and  with  each  prediction  is  a  list 
of  cells  that  contain  it.  To  find  frequent  co-occurences  between 
some  prediction  p  and  other  predictions,  the  system  looks  for 
predictions  that  appear  in  the  same  cells  as  p  and  accumulates 
the  total  number  of  cells  for  each  that  do. 

Besides  visual  co-occurrence,  an  additional  const  raint  is  added 
to  the  pairs  of  predictions  considered  for  combination.  In  order 
for  the  resulting  successor  prediction  to  a  be  a  spatially  compact 
and  connected  structure  (instead  of  corresponding  to  widely  sep¬ 
arated  parts  of  the  projection  of  the  object ),  only  pairs  t  hat  share 
at  least  one  projected  line  segment  are  considered.  For  example, 
the  two  parallelogram  parts  of  the  triangular  prism  project  ion  in 
figure  5  share  a  segment. 

After  finding  all  commonly  occurring  combinations  at  the 
current  level,  the  compiler  tries  to  select  a  small  subset  that  col¬ 
lective!  nd  efficiently  (over  the  instances  of  the  predictions  1  hat 
they  are  successors  of.  The  successors  arc'  iteratively  selected  in 


the  following  way: 

1.  Select  the  successor  that  covers  the  largest  number  of  pre¬ 
viously  uncovered  instances. 

2  Update  the  records  of  all  the  instances  just  covered  by  this 
new  successor. 

Note  that  this  iterative  selection  process  is  an  inner  loop  to 
the  iterative  process  that  is  building  up  the  prediction  hierarchy 
level  by  level.  This  inner  loop  works  within  a  level  to  select  a 
useful  set  of  successors  at  that  level. 

4.3  Example  Results 

The  prediction  hierarchy  compiler  w  as  run  on  the  model  base 
of  polyhedral  objects  shown  in  Figure  2.  Specialization  was  used 
in  only  a  rudimentary  way,  to  distinguish  the  tall  box  from  the 
cube  at  the  very  end.  All  the  other  predictions  were  formed  by- 
combination. 

The  resulting  hierarchy  is  shown  diagrammatically  in  Fig¬ 
ure  7.  As  can  be  seen,  it  is  a  quite  reasonable  representation 
of  visual  knowledge  about  these  objects,  organized  for  efficient 
recognition.  Other  prediction  hierarchies  are  also  possible  given 


Kip.  7.  Tin1  re.- n  1 1  iue  prediction  hierarchy  compiled 
from  view?  of  ihe  objects  in  fie  lire  2  The  nodes  represent 
predictions  and  arrows  indicate  romhinal  ion  and  special¬ 
ization  link-  The  predictions  me  represented  prnphicnlly 
by  seprnents  and  dashed  lines  f<n  parallel  relation- 
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the  objects  modeled.  For  example,  node  6  of  Figure  7  could  alter¬ 
natively  have  been  the  combination  of  a  triangle  with  a  parallel¬ 
ogram,  rather  than  the  {/-shaped  structure.  This  might  actually 
be  preferred  for  greater  noise  insensitivity,  since  the  parallelo¬ 
gram  could  afford  to  lose  more  line  segments  to  noise  and  still 
distinguish  the  triangular  prism  from  the  other  objects.  Since 
the  compiler  is  a  single-pass  algorithm,  only  combining  predic¬ 
tions  that  have  already  have  been  generated,  it  noticed  the  tri¬ 
angle  and  [/-shape  combination  before  the  parallelogram  was 
constructed.  It  may  be  possible  to  add  a  post-processing  step 
that  realizes  more  intelligent  combinations  than  can  be  isolated* 
in  a  single-pass  system. 

5.  Matching 

This  section  is  a  brief  summary  of  the  matching  system  cur¬ 
rently  under  development.  The  system  tries  to  find  matches  to 
the  object-specific  goal  predictions  given  a  prediction  hierarchy 
and  matches  to  the  primitive,  base-level  predictions.  Starting 
with  these  readily  found  base-level  matches,  the  basic  idea  is  to 
repeatedly  attempt  to  extend  existing  matches  into  matches  of 
more  complex  successor  nodes  until  good  matches  to  goal  nodes 
are  achieved. 

The  following  problems  are  being  investigated  in  the  process 
of  designing  this  system: 

1.  Missing  straight-line  segments  in  the  image,  or  those  “not 
seen”  by  the  matcher  because  they  arc  mis-measured,  can 
block  match  extensions  that  are  steps  towards  a  goal  node 
match. 

2.  For  a  given  match,  there  may  be  a  large  tree  of  direct 
and  indirect  successors  of  the  node  matched  The  matcher 
should  avoid  exhaustively  searching  this  tree  for  matches 
to  successors 

3  Because  predictions  are  often^jepresented  as  combinations 
of  predecessors  that  may  themselves  be  large  and  complex, 
the  attempt  to  extend  a  match  to  a  successor  match  car. 
involve  a  costly  search  for  the  other  predecessor.  These 
searches  should  be  avoided,  or  methods  of  reducing  their 
cost  should  be  implemented. 

4.  Image  clutter  can  generate  a  lot  of  possible  match  exten¬ 
sions,  potentially  slowing  down  the  search  for  goal  matches. 

5.  Symmetries  in  a  node’s  relational  graph  description  can 
slow  the  search  for  matches  to  its  successors.  There  may 
be  many  possible  locations  to  check  for  the  match  of  the 
other  predecessor.  There  also  can  be  ambiguities  to  resolve 
in  the  image-successor  match  bindings,  given  the  matches 
to  the  predecessors. 

We  believe  that  a  system  of  the  following  basic  design  can 
handle  these  problems.  It  is  a  set  of  asynchronous  matching 
processes  that  are  distributed  spatially  about  the  image.  Each 
process  repeatedly  attempts  to  jointly  extend  pairs  of  existing 
matches  whose  image  segments  lie  in  a  spatial  zone  associated 
with  the  process.  The  extension  operation  has  the  following 
steps: 

•  Select,  a  pair  of  existing  matches  whose  image  segments 
have  some  non-accidental  relationship,  such  as  those  dis¬ 
cussed  in  1 1 3 j .  or  matches  that  actually  share  an  image  seg¬ 
ment,  Segment  sharing  can  be  considered  a  certain  kind 
of  non-accidental  relationship:  the  identity  relation. 


•  Search  the  prediction  hierarchy  for  common  -  and  possibly 
indirect  -  successors  to  the  pair  of  nodes  matched.  This  is 
done  using  a  marker  passing  and  collision  detection  scheme 

•  Attempt  to  complete  a  match  to  the  common  successor 
given  the  pair  of  predecessor  matches  and  the  part-whole 
maps  (see  Section  3)  between  their  relational  graph  ele¬ 
ments.  The  completion  process  may  involve  a  search  for 
particular  image  segments  if  the  common  successor  is  in¬ 
direct. 

For  typical  images,  a  number  of  the  expected  straight-line 
segments  are  missing  or  mis-measured.  In  spite  of  this  incom¬ 
plete  information,  the  objects  are  often  readily  recognizable.  For 
a  recognition  system  to  detect  objects  in  such  data,  it  must  have 
some  ability  to  find  and  accept  good  partial  matches,  where  the 
quality  of  the  partial  match  is  some  function  of  the  fraction  of 
well-matched  segments.  Because  the  absence  of  segments  may 
prevent  a  good  partial  match  of  a  direct  successor,  but  not  to 
an  indirect  successor,  the  system  should  be  capable  of  extending 
a  node’s  match  to  one  of  its  indirect  successors.  For  example, 


match  to  A 


match  to  D 


Fig.  8.  Example  of  an  image  with  line-segments  ab¬ 
sent  in  such  n  wav  that  an  indirect  successor  (//)  of  a 
node  (4)  has  a  better  match  than  the  direct  successor  be¬ 
tween  them  (/?)  given  that  match  quality  is  in  terms  of 
the  fraction  of  the  prediction  elements  matched.  Matched 
image  segments  are  shown  in  bold. 
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extending  the  match  of  A  in  Figure  8  to  a  satisfactory  match  to 
B  may  be  impossible;  but  a  match  to  D,  an  indirect  successor 
of  A ,  may  be  good  enough  in  spite  of  the  missing  segments. 

Each  node  may  have  a  potentially  large  tree  of  successors, 
and  matching  to  a  successor  can  be  involved.  Thus  the  attempt 
to  find  matches  to  direct  or  indirect  successors  should  not  be 
an  exhaustive  search  of  the  tree  of  possible  successors.  This  is 
where  the  strategy  of  attempting  to  jointly  extend  a  pair  of  ex¬ 
isting  matches  helps:  it  focuses  the  search  to  intersections  of  t  he 
successor  trees  of  the  matched  nodes.  These  intersections  tend  to 
involve  much  smaller  numbers  of  successors,  and  their  matches 
are  more  promising  since  two  predecessors  have  already  been 
matched.  Joint  extension  of  pairs  of  matches  is  especially  useful 
when  the  nodes  have  symmetries.  In  these  cases,  a  thorough 
search  for  a  successor,  given  a  match  to  only  one  predecessor, 
may  involve  checking  many  image  locations  for  the  match  of  I  he 
other,  missing  predecessor. 

Pursuing  joint  extensions  of  pairs  of  matches  does  help  reduce 
the  amount  of  effort  the  matcher  spends  in  the  costly  searrh  for 
complex  predecessor  matches;  but,  for  the  system  to  rely  on  joint 
extension,  pairs  of  jointly-extendible  matches  have  to  be  gener¬ 
ated.  Also,  both  members  of  such  a  pair  should  be  generated 
within  a  reasonable  time  frame  (i.c.,  there  should  not  be  a  long 
delay  between  their  creations).  By  following  a  simple  depth-first 
policy,  the  matcher  may  often  attempt,  a  large  number  of  alter¬ 
native  matches  to  roughly  the  same  image  data,  and  thus  delay 
the  generation  of  matches  over  other  image  data  'Phis  is  unde¬ 
sirable  because  predecessors  of  a  node  tend  to  represent  parts 
of  the  node's  relational  graph  description  that  are  spatially  dis¬ 
tributed.  In  other  words,  jointly-extendible  matches  may  share 
some  image  segments,  but.  they  tend  to  occupy  different  portions 
of  the  image.  Thus,  if  matches  are  being  attempted  in  different 
parts  of  the  image  at  roughly  the  same  time  and  level  in  the 
hierarchy,  there  is  a  better  chance  for  jointly  extendible  pairs  of 
matches  to  occur  and  hence  a  more  rapid  convergence  to  a  goal 
match.  This  can  be  done  by  explicitly  allocating  the  system  re¬ 
sources  to  matching  in  distinct  spatial  zones  distributed  about 
the  image  This  should  be  done  whether  or  not  the  hardware  is 
capable  of  doing  t  hese  matching  operations  in  parallel. 

A  fair  number  of  matches  to  intermediate  nodes  may  be 
generated  in  a  complex,  cluttered  image  during  the  pursuit  of 
matches  to  goal  nodes,  and  thus  the  number  of  possible  pairs 
of  matches  to  which  joint  extension  can  be  attempted  could  be 
large  The  answer  is  to  rest  rict  the  joint  extensions  to  pairs  that 
are  much  more  likely  to  be  part  of  the  same  object’s  projection. 
In  complex  images,  tins  could  dramatically  reduce  the  number 
of  pairs  attempted,  and  these  pairs  have  a  much  greater  chance 
of  being  extended.  A  good  heuristic  for  doing  this  is  to  select, 
pairs  of  matches  whose  image  segments  have  what  Lowe|13|  de¬ 
scribes  as  associations  that  are  not  likely  occur  bv  accident.  For 
example,  the  matches  of  A  and  C  in  Figure  8  share  an  image 
segment,  which  almost  always  occurs  when  the  matches  are  to 
projections  **f  the  same  object. 


6.  Conclusions 


We  have  described  an  approach  to  rapid  object  recognition 
from  large  model  bases.  A  structure  called  a  prediction  hit  rare  In/ 
is  used  to  organize  knowledge  about  the  visual  appearance  of 
objects  in  a  form  directly  usable  for  rapid  indexing  of  the  model 


and  general  view.  We  present  a  prototype  imp'ementation  of  the 
prediction  hierarchy  compiler  and  show  its  results  using  a  small 
model  base  of  polyhedral  objects.  In  addition,  a  recognition 
system  is  proposed  and  discussed  that  can  efficiently  search  for 
matches  to  the  object-specific  goal  nodes  of  the  hierarchy. 

Current  work  (8]  includes  further  development  of  the  recogni¬ 
tion  system  and  a  prediction  hierarchy  compiler  capable  of  spe¬ 
cialization.  Future  directions  for  this  research  include  a  multi¬ 
pass  prediction  hierarchy  compiler  that  is  more  intelligent  in  its 
selection  of  successor  nodes  and  also  the  compilation  of  sophis¬ 
ticated  control  strategies  into  the  hierarchy. 
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ABSTRACT 

Due  to  the  important  role  that  digitization  error  plays  in  the 
field  of  computer  vision,  a  careful  analysis  of  its  impact  on  the 
computational  approaches  used  in  the  field  is  necessary.  In  this 
paper  we  develop  the  mathematical  tools  for  the  computation 
of  the  average  error  due  to  quantization.  They  can  be  used  in 
estimating  the  actual  error  occurring  in  the  implementation  of 
a  method.  Also  derived  is  the  analytic  expression  for  the  proba¬ 
bility  density  of  error  distribution  of  a  function  of  an  arbitrarily 
large  number  of  independently  quantized  variables.  The  prob¬ 
ability  of  the  error  of  the  function  to  be  within  a  given  range 
can  thus  be  obtained  accurately.  In  analyzing  the  applicabil¬ 
ity  of  an  approach  one  must  determine  whether  the  approach 
is  capable  of  withstanding  the  quantization  error.  If  not,  then 
regardless  of  the  accuracy  with  which  the  experiments  are  car¬ 
ried  out  the  approach  will  yield  unacceptable  results.  The  tools 
developed  here  can  be  used  in  the  analysis  of  the  applicability 
of  a  given  algorithm,  hence  revealing  the  intrinsic  limitations  of 
the  approach. 

X.  INTRODUCTION 

When  an  image  is  taken  a  large  scene  is  mapped  onto  a  plane 
of  small  dimensions.  This  compression  of  the  scene  into  a  small 
plane  together  with  the  fact  that  the  coordinates  of  the  points  in 
the  image  plane  are  quantized  introduces  a  serious  handicap  in 
the  computational  approaches  used  in  computer  vision.  Quanti¬ 
zation  of  the  the  image  plane  is  necessary,  because  otherwise  the 
image  cannot  be  processed  by  a  digital  computer.  The  limited 
resolution  capabilities  of  the  receiver  (which  may  be  interpreted 
as  the  limited  number  of  sensory  elements  in  the  digital  com¬ 
puter)  requires  the  entire  scene  to  be  mapped  on  an  m  by  n 
array  of  points.  Such  an  array  is  usually  much  too  small  to  ade¬ 
quately  represent  the  world  in  front  of  the  camera.  As  a  result  a 
significant  amount  of  error  is  introduced  into  computations  that 
involve  the  locations  of  image  points  and  features. 

Spatial  quantization  is  not  the  only  kind  of  digitization  in 
computer  vision.  As  the  number  of  levels  of  brightness  that  the 
receiver  can  distinguish  is  limited,  the  brightness  level  at  a  given 
pixel  in  the  image  plane  is  also  digitized  into,  say,  k  values.  The 
digitization  of  the  brightness  level,  too,  is  a  serious  drawback  for 
reliable  computations  in  many  areas  of  computer  vision. 

Due  to  the  important  role  that  digitization  error  plays  in  the 
field  of  computer  vision,  a  careful  analysis  of  its  impact  on  the 


computational  approaches  used  in  the  field  is  called  for.  Such 
analysis  can  serve  two  purposes:  a)  It  helps  us  learn  how  much 
the  results  of  the  computations  are  affected  by  this  error,  b)  It 
can  provide  insight  on  how  to  reduce  the  impact  of  this  error. 
Note  that  unlike  most  other  types  of  error,  digitization  error 
cannot  be  reduced  by  performing  more  accurate  experiments. 
Rather  it  is  caused  by  the  inherent  limitations  of  the  instru¬ 
ments  used  in  image  processing.  Therefore,  a  careful  analysis  of 
digitization  error  can  reveal  the  intrinsic  limitations  of  a  given 
approach  or  algorithm. 

In  analyzing  the  applicability  of  an  approach  (which,  of  course, 
is  assumed  to  be  sound  theoretically),  one  must  know  how  sus¬ 
ceptible  to  error  that  approach  is.  To  evaluate  how  error  prone  a 
given  method  is,  one  must  have  a  realistic  measure  of  the  error. 
A  measure  of  error,  for  example,  may  be  the  maximum  error 
that  can  occur  in  applying  an  approach,  namely  a  worst  case 
analysis  of  the  approach.  In  general,  however,  knowledge  of  the 
maximum  error  is  of  limited  value,  because  frequently  the  actual 
error  that  occurs  in  the  implementation  of  an  approach  is  far  less 
than  the  maximum  error  which  could  have  occurred.  The  max¬ 
imum  error  is  a  gross  exaggeration  of  the  error  and  should  not 
be  used  in  the  evaluation  of  the  applicability  of  a  method  unless 
the  likelihood  of  its  occurrence  is  significant.  In  evaluating  the 
reliability  of  an  approach  in  practice,  a  much  better  measure  of 
error  is  the  average  error. 

Also,  it  is  very  useful  to  know  the  distribution  of  the  error. 
That  is,  in  addition  to  the  average  error,  one  would  like  to  know 
what  is  the  probability  for  the  error  to  be  within  a  given  range. 
For  example,  is  the  probability  of  the  error  being  larger  than  a 
certain  threshold  significant,  or  is  it  negligible? 

2.  FORMULATION  OF  THE  AVERAGE  ERROR 

The  maximum  quantization  error  in  quantity  A  being  quan¬ 
tized  is  half  the  size  of  the  quantization  unit,  -7.  (For  example,  in 
spatial  quantization  the  maximum  error  is  half  a  pixel.)  That  is 
to  say,  the  actual  value,  A„,  of  the  quantity  A  can  be  anywhere 
in  the  interval  /  =  [A,  -  J,A,  +  j]»  where  A,  is  the  value  of 
A  after  quantization.  Furthermore,  the  likelihood  of  the  actual 
value  of  A,  i.e.  A0,  being  at  a  certain  place  within  the  interval 
follows  a  uniform  probability  density.  In  other  words,  the  prob¬ 
ability  of  A„  lying  inside  the  small  interval  dA  around  the  point 
A,  within  l  is  independent  of  A  and  equal  to  dA/7.  Thus,  the 
average  value  of  the  digitization  error  in  A  is  given  by 
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If  X  is  a  function  of  several  quantities  each  of  which  is  quan¬ 
tized,  then  the  situation  is  more  complicated.  Let  A  =  /(A, 
A}  . . .  Ay),  where  the  actual  value,  A,„,  of  the  quantity  Ai  can 
be  anywhere  in  the  interval  /<  =  [Aiq  -  Aiq  +  with  equal 
probability.  As  before  A,,  is  the  value  of  Ai  after  quantization. 
To  obtain  the  error  in  A  due  to  quantization  errors  in  the  A,- 'a, 
we  consider  the  Taylor  series  expansion  of  A,  retaining  only  the 
first  order  terms  in  the  quantization  error  of  independent  vari- 


From  now  on  without  loss  of  generality  we  replace  7  by  1.  It 
will  be  understood  that  if  the  error  is  due  to  the  digitization  of 
the  intensity  level,  the  limits  of  the  integral  will  be  ±1/2  units  of 
gray  level;  if  the  error  is  due  to  spatial  quantization,  the  limits 
of  the  integral  will  be  ±1/2  pixels. 

A  simple  example  illustrating  the  application  of  the  above 
integral  is  as  follows.  Consider  the  horizontal  distance  A  be¬ 
tween  two  pixels  (A,,y)  and  (A2,y)  in  the  image  plane.  Thus 
A  =  Ai  -  Aj  which  gives  a,  =  1  and  02  =  -1.  Hence  the 
average  error  in  the  distance  A  is 


j  ables  AAj  =  A,  —  A,-,: 

—  r1/’2  rl 

l 

E=  dx\  / 

1  AAcs^-AA,±- 

■+-T7-& Ay- 

(2) 

/- 1/2  J- 

i  dAt 

dAy 

while  the  maximum  error 

1/2 


dxi\x ,  ■ 


1 


*21=3=0.33, 


(6) 


This  expansion  is  valid  when  A  A,  or  equivalently  (since  |  AA,|  < 
7i/2)  the  interval  /<  is  small  compared  to  A;,.  When  the  interval 
Ii  is  not  small,  higher  order  terms  in  the  expansion  must  also 
be  taken  into  account.  However,  here  we  assume  that  the  linear 
terms  in  AAj  are  sufficient.  This  is  the  only  approximation 
we  make  in  this  paper.  Obviously  if  higher  order  derivatives 
of  A  =  /(A,...A/v)  are  small  or  vanish,  then  expression  (2) 
remains  valid  even  if  the  interval  is  not  small. 

The  maximum  error  in  A,  i.e.  Umax,  is  thus  given  by 
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standard  deviation  is  a  =  l/\/6  =  0.41  pixel. 
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3.  CALCULATION  OF  THE  AVERAGE  ERROR 
INTEGRAL 


In  this  section  we  calculate  the  average  error  integral 
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where  E{imaz  =  7,/ 2  is  the  maximum  error  in  A,,  and  the  coef¬ 
ficient  ati  =  is  evaluated  at  A,,  . .  .  Ayq. 

The  most  realistic  measure  of  the  quantization  error  is  the 
average  error  £,  which  is  the  average  of  the  absolute  value  error 
E  =  |AA|,  given  by 


for  arbitrary  N  and  a.  Here  a  =  (ai, ... ,ay).  This  inte¬ 
gral  is  independent  of  the  sign  of  or,.  In  order  to  show  this  we 
make  the  transformation  x,  — >  —x,  and  obtain  E(. . .  ,a ,,...)  = 
E(. . . ,  -a i, . . .).  Hence,  we  assume  that  all  a,  >  0  We  order 
the  ati  such  that  07  >  ©2  •  •  •  >  c*jv  >  0,  with  at  least  a,  ^  0; 
otherwise  the  integral  vanishes. 

We  would  like  to  remove  the  absolute  value  sign  in  the  inte¬ 
grand  of  (7);  to  this  end  we  introduce  the  Heaviside  step  function 
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This  step  function  satisfies  the  identity 


where  similar  to  (1)  x,  stands  for  x,  =  A,  —  A,?. 

Because  the  evaluation  of  the  average  error  for  an  arbitrary 
error  distribution  is  not  possible,  often  for  convenience  the  stan¬ 
dard  deviation,  a,  is  used  a3  the  measure  of  error(l].  The  stan¬ 
dard  deviation  for  quantization  is  defined  through 


0(f)  +  0(— f)  =  1  for  all  f. 


(9) 


With  the  help  of  step  functions  we  can  express  an  arbitrary 
function  /(|t|)  as 
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/(|f|)  =  /(f)0(f)  +  /(-f)0(-f). 
Thus  we  may  write  (7)  as 


(10) 


where  is  the  standard  deviation  of  the  random  variable 

H  with  uniform  distribution  in  the  interval  [ — y.- / 2 , 7,/ 2] .  For 
the  very  special  case  when  N  — »  00  (so  that  the  Central  Limit 
Theorem  absolutely  holds)  and  all  the  products  0,7,  are  iden¬ 
tical,  the  simple  relationship  tr / E  =  \J* /I  exists  between  the 
average  error  and  the  standard  deviation[2],  however,  in  general 
no  particular  relationship  between  the  two  is  known.  Also,  as 
far  as  we  know  a  derivation  of  (4)  good  for  quantization  and 
instrumental  error  has  not  appeared  in  the  literature. 

The  average  error  integral  (4)  can  be  evaluated  approxi¬ 
mately  using  the  Central  Limit  Theorem(3,4].  This  approxi¬ 
mation  works  well  only  when  /V  is  large  and  all  the  a,  are  of  the 
same  order  of  magnitude.  In  the  next  section  we  calculate  the 
average  error  integral  exactly. 
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where 


A  =  07x1  + - h  atyXy 


(12) 


We  make  the  transformation  xj  — ►  -x\, . . .  ,xy  — ♦  -xy  in 
the  second  term  of  (11);  thus  the  second  term  becomes  identical 
to  the  first  term,  since  A  — *  -  A  as  seen  from  (12),  and  we  obtain 
the  constrained  integral 
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Before  we  proceed,  consider  the  following  integral: 
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'"here  k  is  a  non-negative  integer.  The  va.de  of  this  integral 
depends  critically  on  the  value  of  a. 

(i)  If  a  >  |  (i.e.  the  upper  limit  of  the  integral),  the  con¬ 
straint  0(t  -  a)  is  never  satisfied  for  any  value  of  t  (— j  <  t  <  j) 
and  the  integral  J \  vanishes.  Hence  we  may  as  well  write  (14) 


=  dt(t  -  a)kS(t  -  a)  (15) 

^  J- 1/2 

(ii)  If  -  j  <  a  <  j,  equation  (15)  reduces  to 

J*  =  e(i-a)e(i  +  a)/a1/Jdt(t-a)* 

=  j*r(l—>‘+1e(i —)»(!  +  «>  (16) 

(iii)  If  a  <  —  j,  equation  (15)  reduces  to 
Jt  =  0(i  -  «)0(-§  -  a)  />{*,  dt(t  -  a)* 

=  M  -  °)k+1  -  H  -  a)‘+1le(i  -  «)o(-i  -  «)• 

(17) 

The  integral  J*  of  (15)  is  thus  the  sum  of  (16)  and  (17),  that 
s 

Jk  =  Ffi{(i  -  a)t+1[0(i  +  “)  +  ©(-*  -  a)] 

(18) 

-H-.re(-|-a)}8||-4). 

By  using  the  identity  (9)  we  reduce  (18)  to 

•/i  =  jkTT{(i_a)*+,-(-5_a)*+le(”l'o)}e(5_a)  (19) 


We  further  note  that  for  the  second  term  of  (19)  the  con¬ 
straint  0(|  -  a)  is  already  included  in  0(- j  -  a)  and  hence 
redundant.  Therefore,  we  find  the  following  result  for  the  inte¬ 
gral  (14): 


Jl_{%dt(t-a)ke(t-a) 

=  ih  [(i  -  -  •>  -  (-1  -  °)t+1eH  -  <0]  • 

(20) 

With  repeated  use  of  (20)  we  can  now  calculate  the  integral  E(a) 
given  by  (13). 

To  perform  the  xi-integral  we  express  E(a)  in  the  form 


/•i/a 

/•I/a  r 

/  dxi  i 
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J-  l/i  J 

-1/2  J- 

a=  (aixi  +  ■  ■  ■  +  aNxN)/ax.  (22) 

Now,  using  (20)  with  k  =  1  we  do  the  xi-integral  and  obtain 


£(“)  =  dxi-  f-(% dxN 

[-j-(-t-oti  +  2a2x2  +  •  •  •  +  2orjvX/v)J 

0(-t-aj  +  2qj*j  + - 1-  2 ow*tv)  (23) 

-(-a,  +  2orjXj  H - h  2or/v*w)1 

0(-c«i  +  2a2x2  + - h  2ar;yx/v)] 

The  x2-integral  may  be  similarly  performed.  Note  that  both 
terms  in  (23)  have  the  structure  of  (20)  with  k  =  2,  and  each 
of  them  yields  two  terms  when  the  x2-integral  is  carried  out. 
Hence,  after  the  x2-integral  is  done  (23)  becomes 

£(“)  =  vfash  I-{% dx*  •  /- y% dx” 

['t-(-f*ot|  +  a2  +  2asxs  H - b  2 a/vxw)3 
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©(— Ofi  +  a2  +  203X3  - h  2onXn)  (24) 

-(+Qi  —  a2  +  203X3  H - 1-  2a/vi/v)s 

0(+oi  -  o2  +  203X3  H - 1-  2 ajvx/v) 

+(— 01  —  a2  +  203X3  H - h  2ajvX/v)s 

©(— Oi  —  a2  +  203X3  +  •  •  •  +  2ajvx;v)] 

The  remaining  integrals,  xs  through  tjv,  may  be  carried  out 
in  a  similar  manner  by  using  the  integral  (20)  with  k  =  3  through 
k  =  IV.  There  will  be  a  proliferation  of  terms,  for  each  time 
another  variable  is  integrated  out  the  number  of  terms  doubles. 
Thus,  we  obtain  2N  terms  altogether  when  all  the  integrals  are 
done.  The  final  result  can  be  expressed  in  the  compact  form 

^(®)  =  2W(JV+1)I  a.  £t»,=±l  •  • 

(25) 

[(n£,  *i)(E&i  tr.cx,)] 

Here  we  remark  that  in  spite  of  the  presence  of  the  a,’s  in 
the  denominator  of  (25),  this  expression  remains  finite  as  any 
o,  goes  to  zero  (as  it  should).  For  the  sake  of  simplicity  assume 
that  a n  — *  0.  To  verify  that  (25)  does  not  diverge,  we  expand 
its  numerator  in  powers  of  an  and  retain  terms  linear  in  crjv- 
Doing  so  we  get 

*.«.)w+,e(E£,  not) 

(26) 

-  2 (N  +  1  )<**(£&*  Ka,)^^1  *.<*<) 

and  thus  (25)  reduces  to  an  identical  expression  with  N  — *  N—l. 

In  Section  5,  we  present  an  efficient  algorithm  for  computing 
the  expression  (25)  for  the  average  error. 

4.  ANALYTIC  EXPRESSION  FOR  THE 
DISTRIBUTION  OF  ERROR 

The  minimum  value  that  the  error 

E  =  |y4|  =  loqxi  + - b  OftTXjvl  (27) 

can  have  is  of  course  zero  and  the  maximum  value  of  the  error 

‘s  1 

Emaz  =  ^(“l  +  '  '  ’  +  aJv)  (28) 


corresponding  to  when  all  the  variables  have  their  maximum 
error  x,  =  1/2  (or  -1/2).  Note  that  all  the  a,  are  taken  to 
b<>  positive.  In  this  sector  '•alculate  the  probability  for  the 

error  E  falling  in  the  interval  [»j,  f]  where  0  <  rj  <  f  We 

denote  this  probability  by  P(t)  <  E  <  ().  The  most  convenient 
way  of  calculating  this  probability  is  to  express  it  as 

P(n  <  E  <  s)  =  P(E  >n)~  P(E  >  ()  (29) 

and  then  calculate  P(E  >  rj),  i.e.  the  probability  of  the  error  E 
being  larger  than  rj.  This  probability,  P(E  >  */),  is  the  sum  of  all 
the  possible  combinations  of  the  errors  x,  in  individual  variables 
that  would  yield  a  collective  error  E  larger  than  a  certain  value 
t /,  which  may  be  expressed  as  the  constrained  integral 

r l/J  ri/l 

P(E  >  rj)  —  I  dx i  ...  I  difi/StE  —  fj).  (30) 

J- l/l  J- 1/2 

We  now  calculate  this  integral  for  an  arbitrary  number  of  vari¬ 
ables  N  and  arbitrary  values  of  parameters  at,-  and  r;. 

By  using  E  —  |yt|  from  (27)  and  formula  (10)  we  can  write 
(30)  as 

P(E>v)  =  f%dxl...f1_'*dxN 

(31) 

[6(A  -  «|)©(A)  4  0(-A  -  ij)e(-A)]. 

By  making  the  transformation  Xi  — *  -x,t  . . .,  xs  — *  -xn  in 
the  second  term  of  (31),  and  noting  that  A  — *  —A,  this  term 
becomes  identical  to  the  first  term  and  we  get 


r  l/l  /-i/J 

P(E>r,)  =  2/  dii-../  dx/i/Q(A  -  rj)0(A).  (32) 

J-l/t  J- 1/2 

The  constraint  A  >  0  can  be  eliminated  since  it  is  contained  in 
A  >  r)  >  0  and  is  redundant;  therefore 


P(E  >v)  =  2  Ji'j,  dx,  . . dxNe(A  ~  V) 

=  2  /-{/i  dii  •  ■  ■  dx/v6(a iii  H - (-  c*nxn  ~  l) 

(33) 

The  above  integral  can  be  calculated  in  a  manner  similar  to  the 
average  error  integral,  E(a),  with  repeated  use  of  the  result  (20). 
The  xi-integral  can  be  performed  by  using  (20)  with  k  =  0,  after 
which  we  obtain 


P(E>v)=±f!{%dxi...f. 


f-l/2  dxN 


[+(+ati  +  2otjZ2  H - 1-  2olnxn  —  2rj) 

0(+oii  4-  2ajij  4  •  ■  •  4  2 otjvzjy  —  2t?)  (34) 

-(-“l  +  2ajij  H - (-  2atsxN  -  2tj) 

0(-c»l  4  2q]Z2  4  •  ■  ■  4  2anxn  —  2r))|. 

Similarly  the  Z2-integral  can  be  carried  out  by  using  (20)  with 
k  =  1,  which  yields 


p(E  >  *?)  -  vjij-ihrJ-\i7dx* ■  ■  ■  S-i/idxN 

[+(+ori  4  0t2  4  2qjZj  4  •  •  ■  4  2atf/X\  —  2 rj)* 
0(4aii  4  Q2  4  2013X3  4  •  •  •  4  2ajvX/v  —  2rj) 


-(-011  4  Q2  4  203X3  4 - 1  2 osXn  -  2*?)1 

0(-ai  4oj4  2Q3Z3  H - 1-  2anxs  -  2r?)  (35) 


-(4c«i  —  aj  4  2Q3X3  4  •  ■ 
©(4ari  —  012  4  203X3  4 
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The  remaining  integrals,  13  through  zjv,  can  also  be  performed 
by  using  (20)  with  k  =  2  through  fc  =  IV  -  1.  The  final  result 
can  be  expressed  in  the  compact  form 


P(E  >n)  = 


[(n£,  °v)(— 2*?  4  e£i  4  Ef=i  *«*).] 

(36) 

We  make  the  following  remark  about  the  result  just  obtained: 
In  spite  of  the  appearance  of  a,  in  the  denominator  of  (36)  this 
expression  remains  finite  as  any  a,  — »  0.  This  can  be  verified 
in  a  similar  manner  to  (26),  for  as,  e.g.  ajv,  goes  to  zero  an 
identical  expression  is  obtained  with  N  replaced  by  N  —  l. 

5.  AN  ALGORITHM  FOR  THE  COMPUTATION  OF 
THE  AVERAGE  ERROR: 

The  formula  (25)  that  we  derived  for  the  average  error  in¬ 
volves  N  summations  and  2^  terms.  Of  these  2N  terms  only 
those  terms  contribute  whose  corresponding  0-functions  have 
positive  arguments.  Other  terms  must  be  ignored.  A  simple 
algorithm  capturing  these  points  is  as  follows: 

sum  =  0.0 

for  il= — 1 ,  step  2,  1 

begin  <7[l]=il 

for  i2= — 1 ,  step  2,  1 
begin  <r[2]=i2 


for  iN=— 1,  step  2,  1 
begin  <r[N]=iN 
sign  =  41 
B  =  0 

for  i=l,  step  1,  N 
begin 

sign  =  sign  *  <r[«] 

B  =  B  4  <r[ija(«| 
end 

if  B  >  0  then  sum  =  sum  4  sign  *  BN+l 
end 


where  B  =  E£Li  criQi  is  the  argument  of  the  0-function.  Al¬ 
though  for  small  and  fixed  values  of  N  this  algorithm  may  be 
convenient,  when  N  is  large  or  when  it  is  a  parameter  to  the  pro¬ 
gram  the  algorithm  exhibits  several  shortcomings:  i)  the  length 
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of  the  code  becomes  unduly  long,  as  N  loops  are  required;  ii)  as 
the  number  of  loops  in  the  algorithm  depends  on  the  value  of  N, 
♦V  program  itself  Las  to  be  modified  whenever  N  is  changed; 
iii)  the  efficiency  of  the  algorithm  is  far  from  optimum.  The 
reason  for  the  last  shortcoming  is  that  this  algorithm  computes 
the  argument  of  the  0-function,  B,  for  all  possible  combinations 
of  o’ s,  which  amounts  to  2^  times.  As  we  shall  see,  it  is  pos¬ 
sible  to  develop  an  algorithm  that  discards  most  combinations 
of  ar’s  that  yield  negative  B’a,  without  even  taking  them  into 
consideration. 

In  the  algorithm  we  just  described,  at  the  first  iteration  <rl 
through  ffjv-i  remain  unchanged,  e.g.  they  remain  -I,  while  o\ 
undergoes  its  two  possible  states.  In  the  second  iteration  on- i 
changes  its  state  and  On  again  undergoes  its  two  states.  This 
indicates  that  the  number  of  it’s  that  are  negative  (and  for  that 
matter  those  that  are  positive)  constantly  changes.  That  is,  for 
the  four  combinations  mentioned  above  the  number  of  positives 
were  0,1,1  and  2,  respectively. 

We  now  propose  a  different  approach,  namely  one  that  keeps 
the  number  of  negatives  unchanged  as  long  as  possible.  To  this 
end,  we  will  change  the  state  of  any  o  as  frequently  as  necessary. 
More  precisely,  after  considering  the  configuration  where  all  the 
it’s  are  positive,  we  consider  configurations  where  all  but  one  of 
the  <r’s  is  negative.  (There  are  N  such  configurations.)  Then  we 
examine  configurations  having  two  negative  ct’s  and  so  on.  This 
allows  us  to  have  a  short  code,  independent  of  the  size  of  N. 

To  achieve  our  second  goal  which  is  the  improvement  of  the 
efficiency  of  the  algorithm,  we  first  sort  the  a  parameters  so 
that  they  constitute  a  non-decreasing  list.  We  call  the  smallest 
a  the  leftmost  and  the  largest  one  the  rightmost  parameter  in 
the  list.  The  index  of  the  smallest,  i.e.  the  leftmost  a,  is  1, 
that  of  the  one  to  its  right  is  2,  and  so  on.  After  considering 
the  configuration  where  all  <r’s  are  positive,  we  start  with  the 
configuration  where  only  the  leftmost  o  (corresponding  to  the 
smallest  a)  is  negative  and  then  move  this  negative  to  the  right. 
Now  suppose  that  when  this  negative  reaches  position  i  on  the 
list,  B  becomes  negative.  Since  the  a’s  are  sorted,  it  is  clear 
that  the  negative  should  not  move  any  further  to  the  right.  This 
saves  us  the  evaluation  of  N  —  I  configurations  that  have  only  one 
negative.  After  the  configurations  that  have  only  one  negative 
are  exhausted  (either  because  B  has  become  negative  or  the 
negative  has  reached  the  rightmost  parameter),  we  take  the  two 
leftmost  parameters  to  be  negative.  Here,  while  the  leftmost 
negative  is  kept  at  its  spot,  the  other  one  is  moved  across  the 
list  until  B  becomes  negative.  At  this  point,  one  negative  is 
temporarily  fixed  at  the  second  spot  while  the  other  one,  starting 
from  the  third  spot,  is  moved  to  the  right,  and  so  on. 
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Figure  1 
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Figure  2 

The  general  strategy  when  a  configuration  with  B  <  0  is 
encountered  is  as  follows.  The  closest  negative  o,  say  oq,  to  the 
rightmost  negative  o,  say  on,  is  found  subject  to  the  constraint 


that  there  are  at  least  two  positive  o’ s  between  R  and  Q.  (R  and 
Q  are  the  positions  of  or  and  oq  in  the  sorted  list,  respectively. 
See  Figure  1.)  If  <7g  exists,  it  is  obvious  that  oq  +  i  is  positive. 
Here,  the  negative  at  Q  is  moved  to  the  position  <3  +  1  and  all  the 
negatives  to  the  right  of  the  spot  <3  +  1  are  placed  immediately 
next  to  it  without  any  positive  o  between  them.  See  Figure  2. 
The  reason  for  doing  this  is  that  when  B  <  0  is  encountered, 
all  possible  configurations  (subject  to  the  constraint  that  all  the 
negatives  at  Q  and  to  its  left  are  fixed)  that  give  rise  to  B  >  0 
are  exhausted.  (This  can  be  verified  by  close  inspection  of  the 
string  of  positive  and  negative  <r’s  in  a  sorted  list.)  Therefore, 
it  is  time  to  move  the  negative  at  Q  forward  and  to  evaluate 
the  resulting  configurations  starting  with  the  one  where  all  the 
negatives  to  the  right  of  Q  +  l  are  positioned  immediately  next  to 
it.  Also,  one  can  verify  that  when  B  >  0  but  R  =  N,  then  there 
must  be  only  one  positive  o  between  R  and  Q.  The  rest  of  the 
steps  are  the  same  as  in  the  case  where  B  <  0  is  encountered. 
Finally,  if  oq  does  not  exist,  then  it  is  time  to  increase  the 
number  of  negatives.  Or  it  may  be  that  the  evaluation  of  the 
integral  is  complete.  Below  we  present  a  complete  description 
of  the  algorithm. 

If  a  sorting  program  is  not  available  then  one  should  use  a 
different  version  of  the  algorithm.  This  is  done  in  the  Appendix. 

6.  THE  ERROR  DISTRIBUTION  ALGORITHM 

The  expression  (36)  that  we  derived  for  the  error  distribu¬ 
tion,  P(E  >  q),  bears  close  resemblance  to  that  for  the  average 
error.  This  implies:  i)  we  need  an  algorithm  for  the  numerical 
evaluation  of  our  analytical  result;  ii)  the  algorithm  can  be  de¬ 
rived  from  the  one  we  proposed  for  the  average  error  through 
minor  modifications.  The  required  changes  are  as  follows: 
Changes  to  the  Definitions  of  the  Parameters  of  the  Algorithm: 

The  last  item  defined  in  the  average  error  algorithm,  namely 
E,  should  be  deleted.  Instead  the  following  two  items  have  to 
be  added: 

q:  An  input  to  the  algorithm.  The  algorithm  will  compute  the 
probability  of  the  error  to  be  greater  than  q. 

P:  The  output  of  the  algorithm,  namely  P{E  >  ri). 

Changes  to  the  Algorithm  Itself: 

In  step  (3)  calculate  B  from  B  =  -2q  +  eriCt,-- 
Line  11  in  step  (5)  should  read: 

c  =  2^-1iV!  n£Li  Set  P  =  sum/c.  Return. 

Line  1  in  step  (6)  should  read: 

Add  the  term  sign  *  BN  to  sum. 

Finally,  we  have  to  add  one  additional  statement,  namely 

If  R  =  0  set  P  =  0,  and  return, 
to  the  top  of  step  (5). 

7.  EXAMPLES 

Here  we  discuss  several  cases  where  the  method  can  be  ap¬ 
plied  to  obtain  information  about  the  general  reliability  of  vision 
computations. 

7.1.  Derivatives  of  the  Intensity  Field 

We  begin  with  the  computation  of  the  first  difference  of  the 
intensity  field.  Let  /(z,y)  denote  the  intensity  field  at  pixel 
(z,  y).  The  first  difference  of  /(z,  y)  in  the  z-direction,  /x(z,y), 
is  simply 


The  Average  Error  Algorithm: 
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Definitions- 

N:  The  number  of  a  parameters. 

M:  The  number  of  <r’s  that  are  negative. 

R:  The  rightmost  a  whose  value  is  -1. 

Q,  N 1:  When  the  situation  is  such  that  R  should  not  move  forward  any 

further  (either  because  of  B  <  0  or  B  >  0  and  R  =  N),  then  the  closest 
negative  to  R  subject  to  the  constraint  that  there  are  at  least  N1 
positives  between  them  has  to  be  found.  The  place  of  this  negative  is 
called  Q.  (Note  that  if  B  <  0,  1V1  =  2,  otherwise  N 1  =  1.) 

PS:  The  number  of  positives  between  R  and  Q. 

NG:  The  number  of  negatives  between  R  and  Q,  including  Q. 
flag:  is  set  to  1  when  R  =  N  and  B  >  0.  Otherwise,  it  is  set  to  0. 
sign:  II^c,.  That  is,  sign  =  1  if  there  is  an  even  number 
of  negative  <r’s  and  sign  =  -1  otherwise. 

E:  The  output  of  the  algorithm,  namely  the  average  error. 

The  Algorithm: 

1)  Sort  the  a’s  so  that  they  constitute  a  non-decreasing  list.  The  smallest  a  in  the 

list  is  referred  to  as  the  leftmost  element  and  the  largest  one  as  the  rightmost  element. 
Also  set  M  =  —1,  sum  =  0.0,  g  =  -1  and  flag  =  0. 

2)  Increase  the  number  of  negative  <r’s  from  M  to  M  +1,  and  place  them 
at  the  M  +  1  leftmost  positions  of  the  list.  That  is, 

set  M  =  M  +  1,  and 

for  /  =  1  to  /  =  M,  set  a\I]  =  -1 

for  /  =  M  +  1  to  N,  set  <r[I]  =  1  . 

Change  the  sign  of  g,  i.e.  set  g  —  -g. 

Also  set  R  =  M. 

3)  Calculate  B  from  B  =  YliLi  ^“i. 

4)  If  B  >  0  and  flag  =  0  go  to  6. 

5)  Here,  either  we  have  B  <  0  or  flag  =  1.  This  implies  that  either  the 
computation  of  the  integral  is  complete  or  the  configuration  of  positives 
and  negatives  needs  extensive  rearrangement: 

Set  flag—  0. 

If  B  <  0  set  SI  —  2,  else  set  JV 1  =  1. 

Set  NG  =  1. 

Start  from  R,  go  back  to  the  left,  increment  NG  whenever  a  new  negative 
is  encountered. 

IF  the  beginning  of  the  list  is  reached  without  N1  positives  are  encountered,  THEN 
If  B  <  0  the  computation  of  the  integral  is  practically  complete.  Compute 
c  =  2 n{N  +  1)1  n£:  <*••  Set  E  =  sum/c.  Return. 

Else  the  number  of  negatives  in  the  list  must  be  increased.  Go  to  2. 

Else  (i.e.  when  iVl  positives  are  reached)  set  PS  ~  N 1  and  go  back 
further  to  the  left.  Increment  PS  whenever  a  new  positive  is  encountered. 

If  the  beginning  of  the  list  is  reached  before  any  negative  is  encountered, 
then  the  number  of  negatives  must  be  increased.  Go  to  2. 

Else  when  the  first  negative  is  encountered,  set  Q  =  R  -  NG  —  PS. 

Rearrange  the  negatives: 

Move  the  negative  at  Q  to  the  position  <3+1  and  place  all  the  negatives 
to  its  right  at  the  spots  Q  +  2  through  Q  +  NG  +  1. 

Set  R  =  Q  +  NG  +  1. 

Go  to  3. 


6)  Add  the  term  sign  *  BN+1  to  sum. 
ff  R  =  0,  go  to  2. 

If  R  <  N  move  R  to  the  right.  That  is,  set  <r[/?]  =  1. 
Also,  set  R  =  R  +  1  and  a[fl]  =  — X. 

Calculate  B  by  setting  B  —  B  +  a[.R  -  1]  —  orjfJ). 
Go  to  4. 

Else  set  flag  —  1.  Go  to  5. 


/*(*.»)  =  /(*>V)  -  /(*  -  l.v)  (37) 

(see,  for  example,  Rosenfeld  and  Kak}5]).  The  range  of  the  dig¬ 
itization  error  in  either  f(x, y)  or  f(x  —  X,y)  is  from  —1/2  to 
1/2  units  of  gray  level.  The  average  value  for  the  error  in  the 
computation  of  the  first  difference  of  the  intensity  field  is  then 
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E  =  /  dii  /  dx2|ii  -  xj|  =  r  (38) 

J- l/l  J- 1/1  3 

of  the  gray  level  unit. 

In  addition  u.  computing  the  average  error,  it  is  interesting 
to  know  about  the  distribution  of  the  error.  This  can  be  done 
by,  say,  plotting  the  error  distribution  using  the  tools  already 
developed.  Here,  however,  we  content  ourselves  with  a  few  sam¬ 
ples.  The  maximum  error  is  Emax  =  1  gray  level  unit.  The 
probability  of  the  error  being  in  the  outer  75%  range  (i.e.  being 
more  that  3/4  gray  level  unit)  is  P(E  >  0.75)  =  0.0625,  namely 
only  6.25%.  Likewise,  the  probability  of  the  error  being  in  the 
outer  50%  range  (i.e.  being  more  than  1/2  gray  level  unit)  is 
P(E  >  0.5)  =  0.25. 

The  average  value  for  the  relative  error  in  fx(x,y)  is 
3|/(i.vl-/(i-i  v)|'  By  substituting  the  numerical  value  for  the 
difference  between  /(z,  y)  and  /(x  -  l,y)  in  the  above  expres¬ 
sion  one  can  obtain  the  average  value  of  the  relative  error.  The 
distribution  of  the  relative  error  can  be  computed  similarly. 

Next  we  consider  the  second  differences  of  the  intensity  field. 
For  the  second  difference  in  the  z-direction,  fxx,  we  have 


/**(*, y)  =  /(*  -  l,v)+  /(*  +  l.v)  -  2/(x,y)  (39) 

with  the  average  error 


_  rl/1  /■!/>  /•«/» 

E=  dx,  /  dx,  /  dxj|2xi  -  X,  -  x*(  =  0.5833  (40) 
J- 1/2  7-1/2  J- 1/2 

The  maximum  error  is  Em..  =  2(1/2)  +  1/2  +  1/2  =  2  gray 
level  units.  The  probability  of  the  error  being  in  the  outer  75% 
range  (i.e.  being  more  that  1.5  gray  level  units)  is  P(E  >  1.5)  = 
0.0208,  namely  only  2%.  Likewise,  the  probability  of  the  error 
being  in  the  outer  50%  range  (i.e.  being  more  than  1  gray  level 
unit)  is  P(E  >  1)  =0.1667. 

For  the  second  difference  in  the  x  and  y  directions,  /.v.  we  have 


/**  =  /(*.  y)  +  /(*  -  l.v-  i)  -  /(*-  i.v)  -  /(*.»-  0  (41) 

with  the  average  error 

E  =  /'{*,  dx,  /*{*,  dx  j  dx  s  f]f*i ,  dx«|x,  +  x,  -  xs  -  x4| 

=  0.4667 


The  maximum  error  is  Ernax  =  1/2  +1/24-1/2+1/2  =  2  gray 
level  units.  The  probability  of  the  error  being  in  the  outer  75% 
range  (i.e.  being  more  that  1.5  gray  level  units)  is  P(E  >  1.5)  = 
0.0052,  namely  only  0.5%.  Likewise,  the  probability  of  the  error 
being  in  the  outer  50%  range  (i.e.  being  more  than  1  gray  level 
unit)  is  P(E  >  1)  =  0.0833,  i.e.  8.3%. 

Note  that  the  computation  of  fzz  (and  similarly  fyv)  is  sig¬ 
nificantly  less  accurate  than  that  of  /«»■  This  is  in  contrast  to 
the  fact  that  the  maximum  error  for  both  differences  is  the  same, 
namely,  two  gray  level  units.  The  numerical  values  of  the  rela¬ 
tive  error  for  the  second  difference  fxz ,  Rxz ,  can  be  obtained  by 
substituting  for  the  values  of  the  /’ s  in 

Rtx  =  0.5833/|2/(x,y)-/(x-  l,y)  - /(x  +  1,  y)|.  (43) 

For  the  average  value  of  the  relative  error  not  to  be  greater 
than,  e.g.  10%,  the  second  difference  /„  must  be  6  units.  Since 
frequently  this  is  not  the  case,  one  may  conclude  '  at  algorithms 
that  involve  the  second  differences  of  the  intensity  field  are  not 
reliable.  The  numerical  value  of  the  relative  error  for  the  second 
difference  can  be  obtained  from  0.4667/|/(x,  y)  +  /(x-  1,  y- 
1)  -  /(z  -  l,y)  -  /(x,  y  -  1)|  for  /*„. 

7.2.  Computation  of  Rotation  Angles  in  Stereo 

The  previous  example  was  concerned  with  the  quantization 
of  the  intensity  field.  Here,  we  consider  examples  that  involve 
spatial  sampling.  An  interesting  example  is  the  following. 
Kamgar-Parsi  and  Eastman  [6]  have  shown  that  examples  of 
sets  of  image  points  yielding  a  most  stable  computation  of  the 
relative  roll  angle  in  a  stereo  system  with  small  relative  angles 
are  the  combination  of  points  shown  in  Figure  3  or  that  shown 
in  Figure  4. 

It  is  shown  that  for  points  (1,2,3)  of  Figure  3,  the  error  in 
the  roll  angle,  Ar<,u,  is  given  by 

A£i„  -  AS3y 

*roll  * - - - 1  (44) 

where  a  is  half  the  width  or  length  of  the  image  planes. 

The  numerator  is  the  error  in  the  difference  of  the  vertical 
disparities  of  points  1  and  3.  (Note  that  the  quantization  error 
in  point  2  does  not  contribute.)  The  maximum  value  for  this 
expression  is  2  pixels-one  pixel  error  contained  in  each  vertical 
disparity.  Thus,  the  maximum  error  for  the  relative  roll  angle 
will  be  1/a  radians  (note  that  a  is  now  in  pixels).  For  a  typical 
camera  we  may  assume  that  the  focal  length  is  /  =  7  millimeters, 
the  view  angle  is  about  50  degrees  and  the  image  plane  is  512 
by  512  pixels.  This  implies  that  each  pixel  covers  a  square  of 
0,012  by  0.012  millimeters  and  half  the  width  or  length  of  the 
image  plane,  a,  is  about  3.1  millimeters.  For  such  a  camera  we 
have  A r<,n  =  0.22  degrees. 


(42) 


Since  A5]y  —  A6jy  =  Aj/h  -  Ayir  -  Aysi  +  Ays,,  the  average 
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This  quadruple  integral  is  evaluated  in  (42).  It  is  equal  to  7/15. 
Therefore,  on  the  average,  for  points  (1,2,3)  shown  in  Figure  3 
the  error  in  the  roll  angle  is  7/30 a  radians  or  0.05  degrees.  Note 
that  this  error  for  most  practical  purposes  is  not  too  large. 

For  the  configuration  of  image  points  shown  in  Figure  4  we 
have 


A, 0u  ~ 


if*  +  oa)(Agly/2  +  A^iv/2  -  A4,„) 


2«(/*  +  aJ/2) 

The  maximum  error  in  the  previous  case  is  1  /a  and  in  this  case 
(/J  +  aJ)/a(/,  +  a,/2)  >  1/a  radians.  Thus,  the  maximum  error 
in  the  present  case  (which  for  our  camera  model  turns  out  to  be 
0.24  degrees)  is  somewhat  larger  than  that  of  the  previous  case. 
For  the  average  error  we  have 


Erolt  = 


sfc&m  /-1/a  <**»••■  I -ill 
|0.5(xi  +  Xj  +  is  +  I<)  -  Xs  -  ze| 


The  above  multiple  integral  is  equal  to  0.4043.  Therefore, 
it  can  be  shown  that  the  average  error  in  this  case  is  0.4404/2a 
rauians.  Note  that  ihis  is  less  than  the  average  error  in  the 
previous  case,  i.e.  ,4667/2a.  The  interesting  point  is  that  of 
the  two  cases  that  we  considered  one,  namely  the  latter,  has  a 
greater  maximum  error  but  a  smaller  average  error.  The  reason 
for  this  is  that  in  the  latter  case  three  quantities,  namely  A5i„, 
A6j„  and  A£jv  need  to  “enhance”  each  other  in  order  to  yield 
the  maximum  error.  The  previous  case  involved  only  A S\y  and 
A5j»  and,  therefore,  the  probability  of  having  maximum  error 
was  larger. 
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Figure  3 


We  now  look  at  the  probability  of  the  error  for  the  two  cases 
being  in  the  outer  75%  and  50%  ranges.  For  this  we  concen¬ 
trate  on  the  error  distributions  of  A5i„  -  A6jy  and  A5iv/2  + 
Aij„/2  -  A5jy.  The  two  expressions  have  the  same  maximum 
error,  i.e.  two  pixels,  but  different  error  distributions.  For 
the  former  expression  we  obtain  P(E  >  1.5  pixels)  =  0.0052 
and  P(E  >  1  pixel)  =  0.0833,  whereas  for  the  latter  we  get 
P[E  >  1.5  pixels)  =  0.0007  and  P(E  >  1  pixel)  =  0.0417.  Note, 
however,  that  for  the  case  of  Figure  3,  P(E  >  1.5  pixels)  and 
P(E  >  1  pixel)  correspond  to  P(E  >  0.185  degrees)  and  P(E  > 
0.110  degrees),  respectively,  whereas  for  Figure  4  they  corre¬ 
spond  to  P(E  >  0.18  degrees)  and  P{E  >  0.12  degrees).  There¬ 


fore,  to  make  a  meaningful  comparison  between  the  two  Figures, 
we  calculate  P(E  >  0.165  degrees)  and  P(E  >  0.110  degrees) 
for  Figure  4  too.  They  turn  out  to  be  0.0026  and  0.0644  respec¬ 
tively.  These  errors  do  not  seem  to  be  very  significant.  However, 
if  we  were  dealing  with  points  closer  to  the  middle  of  the  image 
plane  the  situation  would  be  different.  Suppose,  for  example, 
that  we  had  the  same  configuration  of  points  except  that  the 
distances  of  the  points  from  the  center  of  the  image  were  all  half 
what  they  are  in  Figures  3  and  4.  In  such  situation  for  the  two 
cases  we  would  have  the  same  error  expressions,  except  that  the 
value  of  a  would  be  divided  by  two.  Thus,  for  a  case  similar 
to  Figure  3,  we  would  have,  e.g.  P(E  >  0.20  degrees)  =  0.12, 
namely  12%.  We  note  that  for  the  3D  recovery  of  object  points 
that  are  not  close  to  the  stereo  system  0.2  degrees  error  in  the 
relative  roll  angle  is  quite  significant.  This  is  especially  true 
because  the  error  in  the  relative  roll  angle  affects  the  error  in 
the  relative  pan  angle,  and  the  error  in  the  pan  angle  by  itself 
increases  quadruply  when  a  decreases  by  half  (see  [6)  and  [7]). 

7.3.  Stereo  Triangulation 

As  the  final  example  we  consider  the  problem  of  “Quanti¬ 
zation  Error  in  Stereo  Triangulation”  discussed  by  Blostein  and 
Huang  [8j.  The  stereo  system  that  they  have  considered  consists 
of  two  identical  pinhole  cameras.  As  regards  the  geometry  of  the 
stereo  setup,  they  have  assumed  the  two  cameras  are  at  equal 
heights  and  their  image  planes  are  on  a  common  geometrical 
plane. 

In  addition  to  the  above  assumptions,  they  have  implicitly 
assumed  that  the  positions  of  the  image  centers  are  known  pre¬ 
cisely.  Therefore,  as  far  as  the  recovery  of  the  3-D  position  of  a 
point  in  space  is  concerned  the  stereo  triangulation  reduces  to  a 
simple  triangle,  where  the  recovered  projecting  rays  (joining  the 
right  and  the  left  image  positions  of  the  point  in  space  and  the 
focal  points)  do  intersect  each  other. 

In  what  follows  we  employ  the  notations  used  in  [8].  The 
origin  of  the  3-D  world  coordinate  system  is  assumed  to  be  mid¬ 
way  between  the  two  focal  points.  Let  the  X-axis  point  vertically 


Figure  4 


upward,  the  Y-axis  point  towards  the  right  camera  and  the  Z- 
axis  point  straight  ahead.  Further,  we  assume  that  the  pixels 
are  squares  of  size  dv.  Let  the  coordinates  of  a  point  S  on  the 
object  be  (x,  y,  r)  and  the  coordinates  of  its  right  and  left  im¬ 
ages  be  (fr,  Jr)  and  (fj,  J|)  in  pixel  units,  respectively.  Ignoring 
the  quantization  error  and  using  the  stereo  triangulation,  the 
computed  coordinates  of  point  S  will  be 


=  {j—fr B  [l  -  •  (Tp)  (48) 


where  B  is  the  baseline,  and  7|  =  Ir.  The  quantities  /j ,  Ir,  Jj 
and  J,  can  be  erroneous  by  up  to  half  a  pixel.  Thus,  denoting 


sw 


the  exact  (unquantized)  coordinates  of  the  image  points  by  (/r  + 
x3,Jr  4-  -n)  tuid  (It  +  zj ,Jt  4-  zj),  we  have  -1/2  <  z,  <  1/2, 
where  i  can  be  1,  2  or  3.  (Note  that  for  this  geometrically  ideal 
stereo  model  the  vertical  quantization  error  is  the  same  in  both 
images.) 

Like  Blostein  and  Huang  we  compute  the  probabilities  of  the 
quantities  e,  =  Az/z,  ev  —  A y/z  and  ez  =  Axjz  being  within  a 
given  range,  where,  e.g.  Az  is  the  error  in  z  due  to  quantization 
and  eM  is  the  scaled  error. 


7.3.1.  Error  in  Depth 

From  equation  (48)  we  obtain  c,  =  (zj  -  zj )/D,  where 
D  =  Jr  -  Ji  is  the  observed  disparity  in  pixels.  An  interest¬ 
ing  analysis  in  [8)  shows  that  when  D  is  not  too  small,  then  we 
may  assume  that  Zi  and  zj  are  quantized  independently,  each 
of  them  following  a  uniform  probability  distribution.  Hence, 
to  compute  the  probability  of  |e,|  being  smaller  than,  say  r, 
namely  P(|e,|  <  r)  or  P(|zi  -  zj|  <  Dr)  we  use  equation  (36). 
By  setting  .V  =  2  and  ai  =  sj  =  1  wc  obtain 


P(|e,|  <  t)  =  1  -  P(|e,|  >  r)  = 


1  -  $[(-2Dr  +  2)’6(-2 Dr  +  2) 


+  terms  having  negative  arguments  to  their  6-functions] 

(49) 


^(k.l  <  r) 


r  >  1/D 
0  <  r  <  1/D 


The  above  result  agrees  with  [8]. 

7.3.2.  Error  In  the  Horizontal  Direction 

Now  we  compute  P(|e„|  <  r).  From  equation  (48)  we  obtain 


Rt*  =  (£  *  J) 11 "  5IJ 


where  R  =  f  /dv  and  h  =  Jr.  As  the  above  equation  is  written 
we  have  aj  =  | h/D  -  1|  and  aj  =  |h/D|.  Depending  on  the 
values  of  h  and  D  we  need  to  distinguish  several  cases.  In  order 
to  avoid  possible  confusion  we  introduce  a  new  variable  sym¬ 
metrizing  aj  and  aj.  Let  i  =  h/D  -  1/2.  Now  it  is  apparent 
that  we  need  to  distinguish  only  two  cases,  namely  |i|  <1/2 
and  |-y|  >  1/2. 

Case  I):  |i|  <  1/2 

In  this  case  ai  =  1/2  -  |i|  and  orj  =  1/2  4-  |t| .  Note  that 
we  have  arranged  the  two  ou’s  so  that  the  smaller  one  is  07. 
Using  (36)  we  see  that  there  are  at  most  two  terms  that  can 
have  positive  arguments  in  their  6-functions.  The  results  follow 
immediately: 


if  ft  >  1 1/2 


P(I*J  <  r)  =  {  Mtjr  ^  111  <  »  <  1/2 


2/* 

1/2+H 


if  0  <  p  <  |i| 


where  p  =  Rr . 

Case  II):  |i|  >  1/2 

In  this  case  cri  =  |i|  -  1/2  and  orj  =  |i|  4-  1/2. 


Vi  .'I'MVCl'tVU 


if  M  >  111 


P(l«vl  <  0  = 


2|7K~41/4  if  1/2  <  p  <  111  (53) 


2M 

1/2+hl 


if  0  <  p  <  1/2 


We  note  that  in  this  case  our  results  are  in  disagreement  with 
those  in  [8], 

7.3.3.  Error  in  the  Vertical  Direction 

Finally,  we  compute  P(|et|  <  r).  From  equation  (48)  we 


«*  =  ^[(«'/D)(n  -  *2)  -  is] 


where  v  =  I\.  Thus, 


**(!«*!  <  r)  =  1  -  P(|e,|  >  r)  =  1  -  P(|^(z,  -  z,)  zs|  >  p) 


where  as  before  p  =  Rt.  From  the  above  expression  we  see  that 
aj  =  ai  =  v,  where  v  =  |t>|/D  (note  that  D  >  0  always),  and 
aj  =  1.  By  using  equation  (36)  we  have 
P(|*'*l  +  i/zj  +  zs|  >  p)  = 


(l/24i/J)[(-2p  +  2u  +  l)56(-2p  +  2t/  +  1) 
-2(-2p+l)50(-2p+l) 


—  (-2p  4-  2u  -  l)s6(-2p  4-  2i/  -  1) 
+  (-2p  -  2i/  +  l)*6(-2p  -  2k  +  1) 


terms  having  negative  arguments  to  their  6-functions] 

(56) 

Hence,  depending  on  the  magnitude  of  v  we  must  distinguish 
three  cases,  1/  >  1,  1  >  v  >  1/2  and  v  <  1/2.  Denoting 


_  (-2u+Zv+l)s 
=  - 24e» 


=  L laM 

-  24  i/3 


^  _  (-2C+2./-1) 
=  24..’ 


n  =  (-2S-2//+1) 
-  24p 


the  results  follow  immediately: 
CASE  I):  v  >  1  or  |v|  >  D 


11  if  2p  >  2k  4-  1 

1  -  Cl  if  2k  -  1  <  2p  <  2k  4-  1 

1  -  Ci  +  Cs  if  1  <  2p  <  2k  -  1 

1  —  Ci  +  2C2  4*  C3  if  0  <  2p  <  1 

(58) 

CASE  II):  1  >  v  >  1/2  or  D  >  |v|  >  D/2 


1  if  2p  >  2k  4-  1 

p(le  ,  <  T)  =  1  -  Ci  if  1  <  2p  <  2k  +  1 

*'  1  1  -  C,  4-  2C2  if  2k  -  1  <  2p  <  1 

1  —  Ci  4-  2Cj  4-  C$  if  0  <  2p  <  2k  —  1 


CASE  III):  k  <  1/2  or  D/2  >  |v| 


1  if  2p  >  2k  4-  1 

p(\c  I  <  T\  -  1  ~  C‘  lf  1  <  2p  <  2k  4-  1 

U  xl  ^  T)  1  -  C,  +  2Ct  if  -  2k  4-  1  <  2p  <  1 

1-C|4-2C2-C4  if  0  <  2p  <  -2r/4-l 

(60) 


Except  for  case  I,  the  above  results  are  in  disagreement  with 
those  of  (8],  Blostein  and  Huang  report  that  in  order  to  com¬ 
pute  i-*(|ex|  <  r)  for  different  cases  they  had  to  evaluate  26  (con¬ 
strained)  triple  integrals  by  means  of  the  mathematical  software 
package  MACSYMA.  This  may  have  led  to  algebraic  mistakes. 
Recalling  that  in  this  example  the  number  of  quantized  vari¬ 
ables  was  only  three,  one  may  conclude  that  lacking  a  general 
formulation  for  the  error  distribution  makes  the  problem  highly 
unmanageable  and  time  consuming  for  cases  where  the  number 
of  quantized  variables  is  large. 

8.  CONCLUDING  REMARKS 

In  computer  vision  only  those  approaches  are  of  practical 
value  that  are  capable  of  withstanding  digitization  o.rcrs,  for  no 
matter  how  accurately  an  experiment  is  done,  digitization  error 
will  be  present.  Unfortunately,  the  computational  approaches 
that  are  proposed  in  computer  vision  are  not  subject  to  any 
error  analysis,  or  if  they  are,  the  sensitivity  analysis  is  mostly 
superficial  and  of  limited  value.  This  may  not  be  an  unaccept¬ 
able  approach  in  other  fields  of  science,  but  in  computer  vision 
where  digitization  error  plays  such  an  important  role  we  need  to 
have  a  careful  error  analysis  of  our  computational  approaches. 

In  this  paper  we  have  presented  a  method  for  genuine  sensi¬ 
tivity  analysis  of  algorithmic  approaches.  The  method  we  have 
developed  does  not  take  into  account  errors  due  to  experiments, 
but  then  such  errors  are  not  due  to  the  intrinsic  limitations  of  the 
approach  in  question.  We  have  developed  mathematical  tools  for 
the  computation  of  the  average  error  due  to  digitization  and  the 
probability  of  the  error  being  within  a  certain  range.  The  aver¬ 
age  error  analysis  is  far  more  realistic  and  superior  to  the  worst 
case  analysis,  because  even  if  the  maximum  error  is  too  great 
an  algorithm  may  give  acceptable  results  in  most  cases.  The 
examination  of  the  error  distribution  provides  us  with  a  power¬ 
ful  measure  for  assessing  the  applicability  of  an  approach.  That 
is,  given  the  acceptable  amount  of  error  in  the  solution  one  can 
determine  the  probability  of  obtaining  good  results.  Conversely, 
given  the  required  confidence  in  the  result  one  can  calculate  the 
amount  of  acceptable  error. 

The  algorithms  that  we  have  presented  for  the  computation 
of  the  average  error  and  the  error  distribution  of  a  given  function 
of  quantized  variables  are  efficient  and  exact.  They  are  also 
convenient  to  use. 
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Appendix 

In  this  appendix  we  prese  t  a  version  of  the  Average  Er¬ 
ror  Algorithm  that  does  not  require  the  coefficients  (a,’s)  to  be 
sorted.  This  algorithm  is  not  as  efficient  as  the  one  given  in 
the  main  text,  but  it  is  more  convenient  to  use  when  a  sorting 
program  is  not  available.  The  definitions  of  the  parameters  are 
the  same  as  those  given  in  the  algorithm  presented  in  the  main 
text. 
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1)  Set  M  =  —1,  sum  =  0.0,  g  =  -1,  and  flag  =  0. 

2)  Increase  the  number  of  negative  <r’s  from  Af  to  Af  +  1,  and  place  them 
at  the  Af  +  1  leftmost  positions  of  the  list.  That  is, 

set  Af  =  Af  +  1 ,  and 

for  I  —  1  to  I  =  Af,  set  a[I]  =  — 1 

for  I  =  Af  +  1  to  N ,  set  <r(7]  =  I  . 

Change  the  sign  of  g,  i.e.  set  g  =  —g. 

Also  set  R  =  Af . 

3)  Calculate  B  from  B  =  ££Li  a, a,. 

If  B  >  0  add  the  term  sign  «  BN+1  to  sum. 

If  R  =  0,  go  to  2. 

If  R  <  N  move  R  to  the  right.  That  is,  set  o[R)  =  1. 

Also,  set  R  =  R  +  1  and  o[i?]  =  -1.  Go  to  3. 

Else  go  to  4. 


4)  Here,  R  =  N .  This  implies  that  either  the  computation  of  the  integral  is  complete 
or  the  configuration  of  positives  and  negatives  needs  extensive  rearrangement: 

Set  NG  =  1  and  PS  =  1. 

Start  from  R,  go  back  to  the  left,  increment  NG  whenever  a  new  negative 
is  encountered. 

IF  the  beginning  of  the  list  is  reached  without  one  positive  is  encountered,  THEN 

If  there  is  no  positive  anywhere,  the  computation  of  the  integral  is  practically  complete 
Compute  c  =  2 n(N  +  1)!  n.^i  Set  E  =  sum/c.  Return. 

Else  the  number  of  negatives  in  the  list  must  be  increased.  Go  to  2. 

Else  (i.e.  when  a  positive  is  reached)  go  back  further  to  the  left.  Increment  PS 
whenever  a  new  positive  is  encountered. 

If  the  beginning  of  the  list  is  reached  before  any  negative  is  encountered, 
then  the  number  of  negatives  must  be  increased.  Go  to  2. 

Else  when  the  first  negative  is  encountered,  set  Q  —  R  ~  NG  -  PS. 


Rearrange  the  negatives: 

Move  the  negative  at  Q  to  position  Q  +  1  and  place  all  the  negatives 
to  its  right  at  positions  <3  +  2  through  Q  +  NG  +  1. 

Set  R  =  <?  +  NG  +  1. 

Go  to  3. 
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Abstract 


We  proposed  a  view  consistency  approach  in  the  1987 
DARPA  Image  Understanding  workshop  as  a  way  to  ex¬ 
tract  three  dimensional  information  from  multiple  im¬ 
ages  of  a  polyhedral  scene,  without  any  knowledge  of  a 
model  or  information  about  the  scene.  We  report  further 
experiments  in  the  use  of  this  approach.  We  also  explore 
the  use  of  this  approach  for  matching  parameterized 
models  in  the  absence  of  complete  knowledge  about  the 
orientation  of  the  camera.  Towards  this  end,  we  have  de¬ 
veloped  Jin  algebraic  reasoning  system  which  combines  a 
Grobner  basis  algorithm  for  reasoning  about  polynomial 
equatioral  constraints  and  B ledsoe-S h os tak- Brooks’  ex¬ 
tended  SUP-INF  method  for  reasoning  about  inequal¬ 
ity  constraints.  Inequalities  can  be  used  to  express  the 
range  of  parameter  values  as  well  as  imprecise  data.  It  is 
shown  how  the  Grobner  basis  algorithm  extends  the  do¬ 
main  of  application  of  the  SUP-fiVF  method  as  wef/  as  in¬ 
creases  the  efficiency  of  the  SUP-INF  method,  especially 
in  the  presence  of  equational  constraints.  A  number  of 
improvements  over  Brooks’  extension  of  the  SUP-INF 
method  are  discussed. 


1  Introduction 


The  use  of  explicit  geometric  models  is  believed  to  be  an  effec¬ 
tive  approach  to  recognize  and  locate  objects  in  natural  scenes 
[Besl  and  Jain;  Brooks],  Geometric  constraints  imposed  by  a 
three-dimensional  model  can  be  used  to  eliminate  false  matches 
to  background  and  clutter  features.  In  addition,  the  process 
of  matching  the  model  to  image  data  can  produce  the  coordi¬ 
nate  transformation  between  the  model  and  the  image  reference 
frame.  In  many  applications,  numerically  exact  geometric  mod¬ 
els  are  often  not  available.  Typically  one  may  have  some  topo¬ 
logical  information  about  the  model  with  partial  knowledge  of 
the  geometry.  Description  of  such  models  can  be  parameterized 
with  parameters  taking  values  within  a  certain  range.  There 
can  also  be  situations  in  which  no  information  about  a  model 
is  available. 


Many  different  methods  have  been  proposed  in  the  litera¬ 
ture  to  generate  three  dimensional  models;  these  include  build¬ 
ing  models  from  CAD  tools  [Brooks],  line  drawings  [Start]  , 
range  data  [Connolly  and  Stenstrom],  and  stereo  data. 


The  goal  of  this  research  effort  is  to  perform  object  match¬ 
ing  in  images  when  numerically  exact  information  about  the 
model  is  not  available.  The  object  matching  process  can  rely 
only  on  parameterized  models  or,  in  the  worst  case,  use  one 
image  as  a  model  to  match  against  another  image.  In  the  last 
Image  Understanding  Workshop  in  1987,  we  discussed  an  ap¬ 
proach  towards  addressing  this  problem.  This  approach  in¬ 
volves  the  use  of  geometric  and  algebraic  reasoning  methods 
to  generate  a  set  of  constraints  on  the  geometric  and  topolog¬ 
ical  structure  of  an  object  from  its  image.  These  constraints 
are  used  as  a  model  for  matching  against  another  image.  We 
have  developed  and  experimented  with  techniques  for  reason¬ 
ing  about  geometry  relationships.  A  geometric  and  algebraic 
reasoning  system,  GEOMETER,  has  been  developed. 

In  [Cyrluk  et  al],  we  reported  on  preliminary  experiments 
about  the  use  of  this  approach  for  simple  idea)  images.  Since 
then,  the  approach  has  been  further  developed,  and  improved 
by  precomputing  certain  algebraic  relationships  as  well  as  by 
making  improvements  in  the  algebraic  reasoning  algorithms 

used  in  GEOMETER.  We  have  been  able  to  improve  the  per¬ 
formance  by  an  order  of  magnitude  and  in  some  cases,  by  two 
orders  of  magnitude.  The  improved  method  has  been  success¬ 
fully  used  to  extract  three  dimensional  information  from  a  more 
complex  polyhedral  object. 

In  our  attempts  to  deal  with  imprecisely  specified  coordi¬ 
nate  positions  of  image  points  as  well  as  errors  in  data,  GE¬ 
OMETER  has  been  extended  to  manipulate  inequality  con¬ 
straints.  This  extension  also  provides  the  ability  to  specify 
parameterized  models  which  can  be  used  to  match  against  im¬ 
ages.  In  a  parameterized  model,  certain  geometric  information 
of  a  model  is  incompletely  specified  using  parameters  taking 
values  within  a  certain  range. 

In  the  next  section,  we  review  the  view  consistency  ap¬ 
proach.  We  discuss  the  algebraic  formulation  of  this  problem. 
We  give  an  overview  of  the  Grobner  basis  approach  for  deter¬ 
mining  the  consistency  of  nonlinear  equational  algebraic  con¬ 
straints.  For  details  of  the  method,  the  reader  may  refer  to  our 
paper  in  the  1987  DARPA  Image  Understanding  proceedings 
[Cyrluk  et  al].  We  report  on  additional  experiments  for  deter¬ 
mining  3-dimensional  structure  from  multiple  two-dimensional 
images  of  polyhedral  blocks. 


"The  work  reported  here  was  partially  supported  by  the  DARPA  Strategic  Computing  Program 
under  the  Army  Engineer  Topographic  Laboratories,  Contract  No  DACA76-86-C-0007 

(Current  address  Department  of  Computer  Science,  Slate  I’niversitv  of  New  York  at  Albany 
Albany,  NY  12222 
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In  Section  3,  we  briefly  discuss  the  SUP-INF  method  for  rea¬ 
soning  about  linear  inequalities  as  proposed  by  Bledsoe  [1975] 
and  improved  by  Shostak  [1977].  Then,  we  overview  Brooks' 
extension  [Brooks]  to  the  method  for  handling  a  subclass  of 
nonlinear  inequality  constraints.  This  is  followed  by  a  discus¬ 
sion  of  our  implementation  of  Brooks’  method  in  GEOMETER 
and  its  integration  with  the  Grdbner  basis  method.  A  number 
of  observations  and  improvements  are  discussed  which  extend 
the  applicability  of  the  SUP-INF  method  as  well  as  improve 
its  performance  by  processing  equality  constraints  using  the 
Grobner  basis  method.  An  extension  to  the  combined  method 
is  discussed  in  which  variables  are  split  based  on  knowledge  of 
their  lower  and  upper  bounds;  this  extension  produces  better 
intervals  of  values  that  variables  can  have  when  constraints  acre 
satisflable. 

Subsequently,  a  number  of  experiments  on  a  very  simple 
example  from  image  understanding  application  are  discussed. 
In  this  example,  image  coordinates  have  errors,  and  a  model  is 
parameterized. 


2  View  consistency  approach 

The  view  consistency  approach  [Cyrluk  et  al]  is  based  on  an 
algebraic  method  to  extract  a  partial  model  from  a  single  two- 
dimensional  perspective  view  of  an  object  and  the  use  of  this 
model  to  recognize  the  object  from  another  viewpoint.  The 
view  consistency  problem  is  defined  as  given  a  two-dimensional 
image  of  a  scene,  determine  whether  another  two-dimensional 
image  could  possibly  be  of  the  same  scene  or  not.  It  is  not  pos¬ 
sible  to  derive  a  complete,  three-dimensional  model  from  a  sin¬ 
gle  view  without  assuming  a  great  deal  about  the  relationships 
between  object  surfaces  and  edges.  For  example  a  common  as¬ 
sumption  is  that  the  object  is  a  polyhedron  with  adjacent  faces 
and  edges  perpendicular  [Herman],  With  this  assumption,  it 
is  possible  to  establish  the  three-dimensional  structure  of  the 
visible  surfaces  from  a  single  view.  Using  multiple  images  of 
a  scene,  it  is  possible  to  derive  three  dimensional  information 
such  as  depth  as  well  as  information  about  the  orientation  and 
position  of  the  camera  if  that  is  not  known. 

In  our  investigation,  we  consider  polyhedron  objects  and  as¬ 
sume  that  the  perspective  image  transformation  can  be  approx¬ 
imated  by  an  affine  transformation.  An  affine  transformation 
is  an  orthographic  projection  with  subsequent  scaling  of  coor¬ 
dinate  axes.  An  affine  approximation  is  more  realistic  than  the 
usual  orthographic  assumption,  but  is  more  constraining  than 
the  full  perspective  case.  We  however  think  that  perspective 
transformation  can  also  be  handled  in  our  approach. 

2.1  Algebraic  formulation  of  the  problem 

The  least  restrictive  constraint  that  can  be  derived  from  the 
projection  is  simply  that  the  image  vertices  are  related  to  the 
corresponding  object  vertices  by  projection  equations  where 
the  three-dimensional  coordinates  of  each  vertex  are  unknown, 
along  with  the  six  affine  projection  parameters. 

Once  a  set  of  projection  constraints  is  established  from  a 
given  view  of  the  object,  they  can  be  used  as  a  model  for  recog¬ 
nition.  The  projection  constraints  are  expressed  as  a  set  of  al¬ 
gebraic  equations  in  terms  of  unknown  three-dimensional  coor¬ 
dinates  including  depths  of  object  vertices  and  transformation 
parameters.  These  depths  cannot  be  determined  from  a  single 


two-dimensional  view.  If  we  consider  another  two-dimensional 
projection,  which  is  hypothesised  to  be  another  view  of  the  ob¬ 
ject,  then  the  equations  derived  from  each  projection  should  be 
algebraically  consistent;  of  course,  the  correct  assignment  has 
to  be  made  between  the  corresponding  vertices  of  each  view. 
Instead  of  assigning  a  single  vertex  in  one  image  to  a  vertex  in 
another  image,  a  group  of  vertices  in  one  image  are  assigned 
to  a  corresponding  group  of  vertices  in  the  other  image.  One 
example  of  such  a  group  is  to  identify  a  cycle  of  edges  and  ver¬ 
tices  in  the  projection  image  that  corresponds  to  a  visible  face 
of  the  three-dimensional  polyhedron;  labeling  can  be  helpful  in 
determining  such  cycles  [Cyrluk  et  al;  Barry  et  al].  The  fact 
that  the  cycle  elements  must  be  coplanar  can  then  be  used  to 
further  constrain  the  three-dimensional  structure. 

The  process  of  recognition  is  to  first  form  assignments  be¬ 
tween  vertex  groups  in  each  projection  and  then  to  test  the 
algebraic  consistency  of  the  resulting  equations  using  the  copla¬ 
narity  constraints  on  vertex  groups.  If  the  equations  are  con¬ 
sistent,  then  the  two  views  may  correspond  to  the  same  three- 
dimensional  object;  or  at  least  they  share  some  solutions  for 
the  possible  objects  that  are  consistent  with  both  projections. 
The  consistency  also  allows  a  more  specific  determination  of  the 
three-dimensional  configuration  of  the  object.  For  example,  if 
the  transformation  between  the  views  is  given  in  advance,  then 
the  correspondence  between  equations  is  similar  to  the  feature 
matching  done  in  classical  stereo  analysis.  The  matching  be¬ 
tween  vertices  provides  the  relative  depth  value  for  each  vertex 
match. 

If  the  transformation  between  views  is  unknown,  then  the 
depths  cannot  be  determined,  but  only  constrained.  The  object 
surface  depth  is  not  explicitly  determined  but  the  solutions  of 
the  equations  represent  a  space  of  possible  object  surfaces.  The 
introduction  of  new  constraints,  either  from  hypotheses  about 
geometric  constraints  on  groups  of  projection  elements,  or  from 
new  views  of  the  object,  will  reduce  the  number  of  unknown 
coordinate  values.  In  this  sense,  the  approach  is  incremental. 

If  an  inconsistency  is  detected,  however,  then  other  assign¬ 
ments  of  vertex  groups  have  to  be  tried.  Various  heuristics 
such  as  the  number  of  vertices  in  a  vertex  group  and  adjacency 
information  about  vertices  in  a  vertex  group  can  be  used  to 
reduce  the  search  space  of  possible  assignments  between  the 
vertex  groups  of  the  two  images. 

In  summary,  the  following  are  the  key  steps  in  our  method: 

1.  Segment  images  and  identify  faces  (using  labeling). 

2.  Match  a  face  in  the  first  image  to  a  face  in  the  second 
image,  i.e.,  assign  vertices  on  the  face  in  the  first  image 
to  vortices  on  the  face  in  the  second  image. 

3.  Check  for  consistency: 

(a)  If  the  match  is  way  off.  an  inconsistency  will  usu¬ 
ally  be  detected  in  the  first  match  itself.  If  all  ver¬ 
tex  assignments  between  the  two  faces  are  inconsis¬ 
tent,  then  backtrack  and  try  to  match  the  first  face 
against  some  other  face  in  the  second  image.  If  no 
such  fare  is  left  in  the  second  image,  then  the  two 
images  are  not  of  the  same  object. 

(b)  Otherwise,  use  the  three  dimensional  information  al¬ 
ready  derived  in  the  form  of  algebraic  relations  to 
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match  a  second  face  in  the  first  image  against  an¬ 
other  face  in  the  second  image. 

(c)  Keep  matching  faces  in  one  image  against  the  corre¬ 
sponding  faces  in  the  other  image  until  the  desired 
three  dimensional  information  is  derived. 


The  use  of  feasible  constraints  for  object  recognition  is  mo¬ 
tivated  by  the  ACRONYM  system  [Brooks];  see  also  [Sugihara], 
In  ACRONYM ,  the  model  of  an  object  is  known;  the  consis¬ 
tency  between  a  known  three-dimensional  model  and  its  projec¬ 
tion  in  an  image  was  tested  using  an  extension  of  the  SUP-INF 
method  of  reasoning  about  linear  inequalities.  The  inequali¬ 
ties  arise  because  of  tolerances  on  the  expected  transformation 
and  image  feature  positions.  By  contrast,  in  the  work  reported 
here,  we  do  not  have  a  three-dimensional  model,  but  only  the 
constraints  imposed  by  a  two-dimensional  view. 


2.2  Testing  Consistency  using  the  Grobner  basis 
approach 

The  consistency  of  the  two  views  can  be  established  by  deciding 
whether  algebraic  constraints  corresponding  to  the  two  views 
are  consistent.  These  algebraic  constraints  are  typically  non¬ 
linear.  In  (Cyrluk  et  al],  we  assumed  that  these  constraints 
are  non-linear  equations,  which  is  the  case  for  ideal  images  in 
which  image  coordinates  are  exactly  known.  The  experiments 
were  able  to  demonstrate  the  feasibility  of  this  approach  in 
such  cases.  A  key  result  is  that  relatively  few  constraints  are 
sufficient  to  prove  the  inconsistency  between  two  views.  In 
the  case  that  the  views  are  consistent,  it  is  then  possible  to 
determine  the  transformation  between  views  and  extend  the 
model  to  include  explicit  three-dimensional  constraints.  Later, 
in  this  paper,  we  use  non-linear  inequality  constraints  also. 

There  are  four  types  of  algebraic  constraints  obtained  by 
matching  vertex  groups  in  one  image  to  vertex  groups  in  the 
other  image: 


1.  Equations  describing  the  affine  projection  relating  object 
points  in  terms  of  image  points  and  projection  parame¬ 
ters. 


2.  Geometric  constraints  describing  relationships  in  each 
vertex  group.  These  equational  constraints  restrict  ver¬ 
tices  in  each  vertex  group  to  be  coplanar. 


3.  Equations  describing  hypotheses  about  possible  assign¬ 
ments  between  vertices  in  one  image  to  vertices  in  the 
other  image. 


4.  Trigonometric  identities  about  the  rotational  angles  in  an 
affine  transformation. 


In  addition,  if  the  relation  between  two  affine  projections 
is  known  such  as  in  the  case  of  stereo  matching,  then  there 
are  additional  equational  constraints  relating  them  also.  The 
details  about  how  such  equations  are  generated  are  discussed 
in  [Cyrluk  et  al]. 

Given  these  equational  constraint,  we  must  be  able  to  solve 
the  following  problems: 


I.  Is  a  given  assignment  of  vertex  groups  in  two  projections 
consistent? 


2.  Given  a  consistent  assignment,  what  can  we  conclude 
about  the  transformation  parameters  and  depth  of  the 
object  surfaces? 


3.  Given  two  vertex  groups  with  consistent  assignments,  can 
they  be  merged  in  a  consistent  manner? 


A  Grobner  basis  algorithm  [Buchberger,  1965;  1985]  is  used 
to  check  whether  these  equations  are  consistent  or  not.  In 
case  the  system  of  equations  is  consistent,  their  Grobner  basis 
embodies  all  the  information  about  the  model  which  can  be 
extracted  from  images.  In  this  sense,  computing  a  Grobner  ba¬ 
sis  serves  the  role  of  compilation  of  available  knowledge.  This 
information  is  stored  for  subsequent  manipulation  when  equa¬ 
tions  corresponding  to  additional  constraints  are  introduced. 

The  concept  of  a  Grobner  basis  of  a  finite  set  of  equations 
was  introduced  by  Buchberger  [1965;  1970]  in  which  he  also 
gave  an  algorithm  for  computing  such  bases.  Grobner  basis 
computation  can  be  used  to  determine  solutions  of  a  system  of 
nonlinear  polynomial  equations.  In  particular  it  can  determine 
the  consistency  of  a  system  of  polynomial  equations.  For  details 
of  this  application  of  a  Grobner  basis  algorithm,  the  reader  may 
consult  [Buchberger,  1985;  Kapur;  Cyrluk  et  al].  Below,  we  give 
the  key  results  used  and  an  overview  of  the  approach. 

Let  Q  be  the  field  of  rationals.  Let  Q[zi , ...,  zn]  be  the  set 
of  all  polynomials  with  indeterminates  Xj,  ...,  xn.  Let  pi  =  0, 
...,  pj  =  0  be  a  finite  system  of  nonlinear  equations  in  which 
pi,  ...,  pj  are  polynomials  in  Q[x 

Proposition:  pi  =  0,  ...,  Pj  =  0  are  not  consistent,  i.e. 
do  not  have  a  solution  in  complex  numbers,  if  and  only  if  a 
Grobner  basis  of  pi,  ...,  Pj  includes  1. 

Proposition:  The  set  of  all  solutions  in  complex  numbers 
of  pj  =  0,  ...,  pj  =  0  are  also  the  common  zeros  of  all  polyno¬ 
mials  of  a  Grobner  basis  of  pj,  ...,  p}  ,  and  vice  versa. 

The  main  advantage  of  computing  a  Grobner  basis  of  the 
set  of  polynomials  pj,...,  p>  corresponding  to  a  consistent  set 
of  equations  pj  =  0,  ...,  py  =  0  is  that  using  the  Grobner  basis, 
all  solutions  of  the  equations  can  be  systematically  examined; 
see  [Buchberger,  1985]  for  details. 

Polynomials  in  this  approach  are  viewed  as  rewrite  rules 
which  can  be  used  to  generate  normal  forms  for  polynomials 
with  respect  to  equational  constraints.  A  Grobner  basis  algo¬ 
rithm  is  similar  to  the  Knuth-Bendix  completion  procedure  in 
which  additional  polynomials  are  generated  from  a  given  set 
of  polynomials  until  it  is  possible  to  generate  canonical  forms 
(unique  normal  forms)  for  polynomials  with  respect  to  equa¬ 
tional  constraints.  The  key  inference  step  (apart  from  rewrit¬ 
ing)  is  to  generate  a  critical  pair  from  two  polynomials  as  illus¬ 
trated  by  the  following  example: 

Given  two  polynomials  x2y  -  3  and  xy2  -  1,  the  least  com¬ 
mon  multiple  of  their  head- monomials  can  be  computed,  which 
is  called  superposition  or  overlap.  In  this  case,  the  superposi¬ 
tion  is  x2y2.  Now,  either  of  the  above  two  polynomials  can 
be  used  to  rewrite  the  superposition.  Rewriting  with  the  first 
one  gives  3y  as  the  result,  whereas  rewriting  with  the  second 
one  produces  x.  The  pair  <3y,x>  is  called  the  critical  pair  of 
the  above  polynomials.  The  equation  3y  —  x  =  0  follows  from 
the  equations  x2y  -3  =  0  and  xy2  -1  =  0.  A  critical  pair  is 
further  rewritten  until  no  more  rewriting  is  possible.  If  normal 
forms  of  the  two  polynomials  in  a  critical  pair  are  not  the  same, 
then  a  new  polynomial  is  obtained  by  taking  the  difference  of 


v 


V.  A.  A  **,  A.  V  /  . 


l 


the  two  normal  forms;  this  new  polynomial  is  augmented  to 
the  original  set.  This  process  of  generating  new  polynomials 
is  continued  until  no  new  polynomials  can  be  generated.  This 
process  is  guaranteed  to  terminate  because  of  Hilbert’s  Basis 
Condition,  a  combinatorial  property  of  ideals  over  Noetherian 
rings  [van  der  VVaerdenj.  If  the  polynomial  1  (in  fact,  any  ra¬ 
tional  number)  is  generated  in  this  process,  this  indicates  the 
original  set  of  equations  is  inconsistent;  otherwise,  the  final  set 
of  polynomials  is  a  Grobner  basis  of  the  input  set  of  polynomi¬ 
als. 

For  further  details,  the  reader  may  consult  [Buchberger, 
1985;  Cyrluk  et  al]. 

2.3  Experiments 

As  reported  in  [Cyrluk  et  al],  an  implementation  of  a  Grobner 
basis  algorithm  was  developed  as  a  geometric  reasoning  sys¬ 
tem  GEOMETER  to  prove  geometry  theorems  [Kapur].  This 
implementation  was  subsequently  modifi,  -1  for  the  view  con¬ 
sistency  approach.  In  [Cyrluk  et  al],  we  reported  on  prelim¬ 
inary  experiments  illustrating  the  view  consistency  approach 
on  a  cube,  both  when  the  relationship  between  projections  is 
known  as  well  as  when  the  relationship  between  projections  is 
not  known.  Since  then,  we  have  made  many  changes  to  our 
implementation,  in  particular  incorporating  the  optimizations 
suggested  in  [Buchberger,  1979]  as  well  as  in  [Kapur,  Musser 
and  Narendran,  1985]  that  reduce  the  number  of  critical  pairs 
that  have  to  be  considered.  We  also  modified  the  completion 
procedure  to  be  patterned  after  a  Knuth-Bendix  completion 
procedure  as  implemented  in  RRL,  a  rewrite  rule  laboratory,  a 
theorem  prover  based  on  rewriting  techniques  [Kapur,  Sivaku- 

mar  and  Zhang]. 

Another  key  modification  that  helped  improve  the  perfor¬ 
mance  was  to  precompute  some  commonly  occurring  equational 
constraints  that  can  be  derived  from  the  properties  of  the  affine 
projections.  As  the  following  table  contrasting  the  current  re¬ 
sults  from  the  results  reported  in  the  1987  workshop  illustrates, 
these  changes  led  to  significant  improvements  in  the  perfor¬ 
mance  (Figure  1).  All  experiments  were  done  on  a  Symbolics 
3600  LISP  machine.  The  following  table  gives  the  timings  for 
the  case  when  the  relation  between  the  projections  is  known. 


Problem 

At  1987  Workshop 

Currently 

Depth 

272  sec 

20  -  30  sec 

Inconsistency 

48  sec 

10  sec 

The  following  table  gives  the  timings  for  the  case  when  the 
relation  between  the  projections  is  not  known. 


Problem 

At  1987 
Workshop 

Currently 

Assigning  one  face  in 
one  image  to  another 
face  in  another  image 

1  hour 

3-4  minutes 

Inconsistent 
extended  assignment 

>  4  hours 

7-10  minutes 

Consistent 
extended  assignment 

>  10  hours 

20  sec  -5  minutes 

ir.  x-axis:  45,  y-axis:  -35.26,  2-axis:  -30 
V:  x-axis:  -45,  y-axis:  -35.26,  2-axis:  30 

Figure  1:  Two  projections  of  a  Cube. 

We  recently  tried  the  view  consistency  approach  on  a  more 
complex  polyhedral  object  (Figure  2).  The  results  were  ex¬ 
tremely  encouraging  when  the  relation  between  projections  is 
known.  Our  implementation  could  deduce  depth  parameters 
in  less  than  a  minute.  However,  when  the  relation  between 
the  projections  are  not  known,  although  the  method  was  still 
able  to  deduce  depth  parameters  as  well  as  the  transformation 
parameters,  it  took  considerably  longer. 


Figure  2:  Block. 
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3  Reasoning  about  Inequalities 

Two  serious  limitations  of  the  use  of  the  Grobner  basis  algo¬ 
rithm  for  the  view  consistency  problem  are  (i)  the  requirement 
that  all  the  information  should  be  specified  in  exact  form,  and 
(ii)  the  inability  to  deal  with  inequalities  which  can  be  used 
to  express  inexact  data.  For  instance,  if  a  coordinate  position 
in  an  ideal  projection  is  v^O,  this  must  be  specified  implicitly 
using  an  equation  x 2  =  20;  further,  it  is  not  possible  to  rule 
out  the  negative  value  that  x  could  take.  However,  the  value 
of  x  could  be  given  to  any  desired  precision  using  inequalities. 
Inequalities  can  also  be  used  to  describe  parameterized  models 
as  discussed  later;  see  also  [Brooks]. 

A  number  of  techniques  have  been  proposed  in  the  liter¬ 
ature  for  reasoning  about  inequalities.  In  the  case  of  linear 
inequalities,  the  well-known  SIMPLEX  algorithm  for  solving 
linear  programming  problems,  and  polynomial  time  algorithms 
of  Kharhiyan  and  Karmarker  can  be  used  for  large  problems. 
However,  for  medium-sized  problems,  especially  arising  in  pro¬ 
gram  verification  application,  the  SUP-INF  method  proposed 
by  Bledsoe  [1975]  and  extended  by  Shostak  [1977]  have  been 
more  widely  cited;  see  also  [Shostak,  1981].  An  advantage  of  the 
SUP-INF  method  is  that  it  can  be  easily  implemented  without 
much  overhead. 

Any  decision  procedure  for  the  theory  of  real  closed  fields 
(also  known  as  the  theory  of  elementary  algebra  and  geome¬ 
try),  such  as  Tarski’s  method,  is  a  complete  decision  procedure 
for  determining  the  inconsistency  of  nonlinear  inequalities  over 
the  reals.  There  exist  many  complete  methods  for  the  theory 
of  real  closed  fields.  The  method  of  cylindrical  algebraic  de¬ 
composition  due  to  Collins  has  been  implemented  as  a  part  of 
the  SAC-2  system  [Arnon  et  al].  Unfortunately,  this  implemen¬ 
tation  needs  considerable  computational  resources  for  solving 
even  simple  problems. 

The  SUP-INF  method  was  extended  by  Brooks  in  a 
straightforward  way  to  consider  non-linear  inequalities  and 
trigonometric  functions.  The  extended  method  is  not  complete; 
actually,  the  subclass  of  non-linear  inequalities  for  which  the 
method  works  is  not  even  known.  Brooks  used  the  extended 
method  in  the  ACRONYM  system  for  model-based  recogni¬ 
tion.  We  have  implemented  this  algorithm  in  GEOMETER 
and  experimented  with  it  on  a  number  of  simple  examples;  we 
left  out  the  trignometric  component  of  the  algorithm  since  we 
treat  sin(0)  and  cos(0)  as  variables  instead  of  functions  of  9. 
We  relate  them  with  the  equation  sin2(9)  +  cos2(6)  =  1. 

Our  implementation  integrates  Brooks’  SUP-INF  method 
with  the  Grobner  basis  algorithm  so  that  equations!  constraints 
are  solved  using  the  Grobner  basis  algorithm  and  inequality 
constraints  are  manipulated  using  Brooks’  algorithm.  This  in¬ 
tegration  of  the  Grobner  basis  algorithm  with  Brooks’  SUP- 
INF  method  will,  henceforth,  be  referred  to  as  the  combined 
method.  The  combined  method  suffers  from  deficiencies  similar 
to  those  of  the  SUP-INF  method.  Below,  we  discuss  a  num¬ 
ber  of  improvements  made  to  the  combined  method  and  show 
that  the  combined  method  not  only  works  better  than  Brooks’ 
SUP-INF  method  but  also  extends  the  domain  of  applicability 
of  the  SUP-INF  method. 

First,  we  give  an  overview  of  the  original  SUP-INF  method 
for  linear  inequalities,  and  then  discuss  Brooks’  extension  for 
the  nonlinear  case.  This  is  followed  by  a  discussion  of  our 
improvements  and  extensions. 


Bledsoe  introduced  the  SUP-INF  method  to  prove  universally 
quantified  formula  in  the  theory  of  Pressburger  arithmetic 
[Bledsoe].  This  method  was  subsequently  extended  by  Shostak 
[1977]. 

The  input  to  the  method  is  a  conjunction  of  linear  inequali¬ 
ties  with  rational  coefficients.  An  equality  constraint  pi  =  0  in 
the  method  is  transformed  into  a  conjunction  of  two  inequal¬ 
ities  pi  <  0  and  pi  >  0.  The  goal  is  to  decide  whether  these 
inequalities  can  be  satisfied.  If  yes,  the  method  produces  lower 
and  upper  bounds  for  each  variable  appearing  in  the  set  of  in¬ 
equalities.  A  set  of  inequalities  are  unsatisfiable  if  and  only  if 
there  is  a  variable  for  which  its  lower  bound  is  greater  than  its 
upper  bound. 

The  method  is  based  on  transforming  inequalities  such  that 
for  each  variable  z,  they  can  be  expressed  as  *  <  u6;  or  x  >  lb,, 
where  u6;  and  lb,  axe,  in  general,  linear  expressions  in  terms  of 
the  rest  of  the  variables.  An  upper  bound  for  z,  SUP(z),  is 
then  the  minimum  over  u6,’s,  where  as  a  lower  bound  for  z, 
INF(z),  is  the  maximum  over  16,.  To  compute  the  minimum 
of  ubi’s,  say,  the  algorithm  is  called  recursively  on  each  ub,  in 
an  attempt  to  compute  the  lower  and  upper  bounds  on  uh,  in 
terms  of  the  variable  z.  Finally,  linear  equations  in  terms  of 
z  are  obtained  for  these  bounds,  which  can  be  solved.  A  dual 
technique  is  used  for  computing  the  lower  bounds.  In  this  way, 
the  algorithm  computes  rational  upper  and  lower  bounds  for 
each  variable. 

Shostak  proved  that  these  bounds  are  indeed  tight.  In  order 
to  decide  whether  a  given  set  of  inequalities  have  an  integer 
solution,  Shostak  suggested  further  improvements  to  Bledsoe’s 
method.  If  the  inequalities  do  not  have  a  real  solution,  then 
they  do  not  have  an  integer  solution  either.  However,  if  they 
do  have  a  real  solution,  i.e.,  the  method  produces  satisfiable 
intervals  of  upper  and  lower  bounds  for  each  variable,  it  can 
be  checked  whether  each  interval  has  an  integer.  If  for  some 
variable,  its  interval  does  not  include  an  integer,  the  inequalities 
do  not  have  an  integer  solution.  Otherwise,  the  procedure  has 
to  search  over  integers  in  the  satisfiable  interval  of  each  variable. 
The  improvements  reported  below  to  the  SUP-INF  method  and 
the  combined  method  should  also  work  for  integer  solutions. 

3.2  Brooks’  extension 

Brooks  extended  Bledsoe-Shostak’s  SUP-INF  method  to  be  ap¬ 
plicable  to  nonlinear  inequalities  including  trigonometric  func¬ 
tions.  For  handling  nonlinear  terms  in  an  expression,  the  ex¬ 
tended  algorithm  attempts  to  determine  the  parity  (whether 
the  value  is  always  positive,  zero  or  negative)  of  variables.  Par¬ 
ity  constraints  from  nonlinear  terms  are  propagated  to  gener¬ 
ate  parity  constraints  on  variables,  which  in  turn,  constrain  the 
parity  of  other  nonlinear  terms  involving  these  variables.  For 
instance,  if  the  parity  of  xy  is  known  to  be  positive,  then  z 
and  y  will  have  the  same  parity,  either  both  are  positive  or 
both  are  negative.  The  method  is  limited,  however,  since  for 
computing  lower  and  upper  bounds  of  nonlinear  expressions,  it 
deals  with  each  nonlinear  term  separately  without  constraining 
other  terms  in  which  common  variables  appear.  For  instance, 
as  also  pointed  out  by  [Sacks],  for  computing  an  upper  bound 
of  z2  -  z  when  z  is  unconstrained,  an  upper  bound  for  z2  and 
a  lower  bound  for  z  are  computed  independently  and  the  same 
value  of  z  is  not  used  for  determining  bounds  of  x2  -  z. 
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Despite  these  limitations,  Brooks  found  this  extension  quite 
adequate  for  his  work  in  model-based  recognition. 

4  Combining  Grobner  basis  algorithm 
and  SUP-INF  method 

We  have  implemented  Brooks’  extended  SUP-INF  method  in 
GEOMETER  and  integrated  with  the  Grobner  basis  algorithm. 
The  integration  is  not  completely  done  yet,  as  some  of  the 
steps  are  currently  performed  manually.  The  basic  idea  is  to 
first  manipulate  equality  constraints  using  the  Grobner  basis 
algorithm,  possibly  deducing  additional  equality  constraints. 
Equality  constraints  are  then  used  as  rewrite  rules  to  simplify 
the  inequality  constraints.  All  constraints  (input  as  well  as  de¬ 
duced)  are  then  transformed  into  inequalities  and  the  SUP-INF 
method  is  invoked.  If  any  new  equality  constraint  is  detected 
(when  the  satisfiable  interval  for  some  variable  or  term  includes 
only  one  value,  i.e.,  its  upper  bound  and  lower  bound  are  iden¬ 
tical),  then  that  equality  constraint  is  further  propagated  using 
rewriting  and  the  critical  pair  computation  in  the  Grobner  basis 
algorithm.  Note  that  this  method  will  only  determine  whether 
a  variable  or  a  term  is  equal  to  some  rational  value,  and  it 
will  not  determine  whether  a  nonlinear  expression  is  equal  to  a 
rational  value. 

Another  modification  made  to  Brook’s  extended  sup-inf  al¬ 
gorithm  is  to  specify  a  subset  of  the  variables  as  input  variables. 
When  SUP  and  INF  are  called  on  these  variables,  no  extra 
computation  takes  place,  rather  the  user  provided  bounds  are 
used.  This  modification  is  especially  useful  in  our  parameter¬ 
ized  model  matching  problems  where  there  are  a  lot  of  variables 
involved.  Below,  we  discuss  some  additional  extensions  to  the 
SUP-INF  method  in  our  integration  which  makes  it  more  ef¬ 
fective. 

4.1  Explicit  use  of  equality  constraints 

As  stated  earlier,  the  SUP-INF  method  transforms  equality 
constraints  into  equivalent  inequality  constraints.  As  a  result, 
it  does  not  make  direct  use  of  equality  constraints  for  simpli¬ 
fying  inequality  constraints.  As  illustrated  by  the  following 
example,  using  equality  constraints  to  simplify  inequality  con¬ 
straints  considerably  improves  the  performance  of  the  method. 
Given  a  set  of  linear  constraints, 

3ii  -  2x2  -  7x3  =  0 
10x3  -  8ii  -  7z2  =  0 
5x3  >  (2x2  +  3) 
x3  >  (x,  +  x2) 

5  >  3xi 

3*2  >  ~(xi  +  4x4) 

2x3  >  5x4 

4X5  >  -(*2  +  Xl) 

2x3  >  7xs 

The  SUP-INF  method  takes  nearly  8.2  second! to  compute 
the  satisfiable  intervals: 

0.87341774  <  xi  <  1.6666666, 

-0.62801933  <  x2  <  -0.32911393, 

0.46835443  <  x3  <  0.8937198, 

0.028481012  <  x.,  <  0.35748792, 

-0.25966182  <  x5  <  0.2553485 
ff  the  equality  constraints  are  manipulated  using  the 
Grobner  basis  algorithm  (or  any  other  method  for  solving  lin¬ 


ear  equations  such  as  Gaussian  elimination)  and  the  normalized 
equalities  are  used  to  rewrite  the  inequality  constraints,  we  get 
the  result  in  the  total  time  of  less  than  1  second.  Using  the 
Grobner  basis  algorithm,  from  the  equality  constraints,  it  is 
first  determined  that 

xj  =  —69/37x3  and  x2  =  26/37x3. 

These  two  equalities  are  used  to  eliminate  x i  and  X2  from 
inequality  constraints.  The  improvement  in  performance  is  ob¬ 
tained  because  the  SUP-INF  method  has  to  consider  less  num¬ 
ber  of  variables  in  this  case. 

We  have  recently  come  to  know  that  Kaufl  [1986]  made 
similar  observations  about  the  SUP-INF  method  while  inves¬ 
tigating  its  use  in  program  verification  and  proposed  the  use 
of  equality  constraints  as  rewrite  rules  for  eliminating  variables 
and  simplifying  inequality  constraints.  Kaufl  proposed  a  tech¬ 
nique  for  determining  additional  linear  equalities  from  linear 
inequalities  by  introducing  new  variables;  we  have  not  investi¬ 
gated  the  effectiveness  of  his  method.  However,  it  is  not  clear 
how  this  method  will  help  in  the  presence  of  nonlinear  inequal¬ 
ities. 

The  SUP-INF  method  produces  the  same  results  no  matter 
in  what  order  the  bounds  for  variables  are  computed.  However, 
it  does  seem  that  the  order  in  which  the  bounds  are  computed 
affects  the  computational  performance  quite  a  bit.  The  method 
seems  to  implicitly  consider  all  possible  orderings  on  variables 
to  compute  bounds;  this  happens  when  each  variable  in  an 
inequality  is  solved  by  moving  the  rest  of  the  expression  in 
the  inequality  onto  the  other  side.  For  example,  an  inequality 
constraint  Xi  +  x2  -  X3  >  0  is  reformulated  as  xj  >  (~x2  +  13) 
for  computing  bounds  on  xi,  as  12  >  (-xj  +13)  for  computing 
bounds  on  x2,  and  x3  <  (xj  +  x2)  for  computing  bounds  on  x3. 

4.2  Storing  intermediate  results 

As  observed  by  Brooks,  the  extended  SUP-INF  method  is 
highly  recursive.  We  observed  while  running  several  exam¬ 
ples  that  many  computations  of  upper  and  lower  bounds  of 
variables  expressed  using  a  given  set  of  variables  (the  result  of 
SUPP  and  INFF)  were  being  repeated.  As  a  result,  we  decided 
to  store  intermediate  results;  for  each  variable  or  a  term,  we 
store  its  upper-bound  and  lower-bound  as  expressions  in  terms 
of  a  given  set  of  variables.  Thus,  whenever  bounds  on  vari¬ 
ables  need  to  be  computed,  it  is  first  checked  whether  bounds 
expressed  in  a  given  set  of  variables  are  already  available;  if 
so,  they  are  retrieved.  Only  if  they  are  not  available,  they  are 
computed.  It  is  possible  to  further  improve  this  feature.  For 
example,  if  a  bound  of  a  variable  in  terms  of  a  set  of  variables 
is  needed,  it  is  sufficient  to  start  with  its  bound  in  terms  of  a 
superset  if  that  is  available,  and  use  that  bound  to  compute  a 
better  bound  in  terms  of  the  required  subset.  We  have  so  far 
implemented  the  simplest  memory  feature. 

For  a  modified  version  of  the  example  discussed  in  the  pre¬ 
vious  section  in  which  the  first  two  equality  constraints  were 
modified  to  inequality  constraints  by  changing  =  to  >,  the 
SUP-INF  method  takes  over  16  seconds  without  the  memory 
feature.  With  the  memory  feature,  however,  only  4  seconds 
were  needed  to  compute  the  bounds.  Similar  improvements  in 
performance  were  observed  on  other  examples.  The  improve¬ 
ment  in  performance  in  fact  appears  to  increase  nonlinearly 
with  the  size  of  an  example;  the  improvement  is  dramatic  for 
larger  examples.  For  the  second  example  discussed  in  the  sec- 
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tion  below  on  imprecise  data  and  parameterized  model,  the 
SUP-INF  method  with  the  memory  feature  took  20  minutes 
where  as  the  version  without  the  memory  feature  did  not  finish 
overnight. 

4.3  Multiple  runs  of  SUP-INF  method  for  non¬ 
linear  inequalities 

In  the  case  of  nonlinear  inequality  constraints,  the  SUP-INF 
method  relies  heavily  on  determining  the  parity  of  variables. 
The  order  in  which  variables  are  processed  becomes  important 
because,  while  determining  bounds  on  a  variable  x,  parity  of 
some  other  variables  may  not  be  known.  However,  those  par¬ 
ities  may  be  determined  subsequently  from  which  it  may  be 
possible  to  obtain  better  bounds  on  x.  In  order  to  get  those  bet¬ 
ter  bounds,  there  may  be  a  need  to  run  the  SUP-INF  method 
more  than  once.  Examples  can  be  constructed  in  which  the 
SUP-INF  method  may  have  to  be  run  the  number  of  times 
proportional  to  the  number  of  variables  in  order  to  obtain  best 
possible  bounds  in  the  presence  of  nonlinear  constraints. 

The  first  example  illustrates  that  two  runs  are  needed  to 
get  the  best  bounds.  The  next  example  illustrates  the  need  for 
three  runs  to  get  the  best  bounds.  Variable  x  is  processed  first, 
followed  by  y  and  then  z. 

x  >  2,  y  >  1,  z  <  9, 

2z  >  3x  ,yz  <  6 

After  the  first  run,  the  SUP-INF  method  returns  the  fol¬ 
lowing  bounds: 

2  <  x  <  4,  1  <  y  <  oo,  3  <  z  <  6. 

While  computing  an  upper  bound  for  y,  it  does  not  know 
the  parity  of  z.  However,  after  the  parity  of  z  is  known,  as 
is  the  case  in  the  second  run,  a  better  upper  bound  for  y  is 
computed.  The  result  after  the  second  run  is: 

2  <  x  <  4,  l  <  y<2,  3  <  z  <6. 

Similarly,  for  the  following  example  in  which  the  variables 
are  processed  in  the  order,  x  first,  y,  z  and  lastly  u,  three  runs 
are  needed  to  get  the  best  bounds: 

X  >  12,  y  >  1,  ;>0,  u  <  3, 

2zu  >  3x,  yzu  <  20,  (yz  +  zu)  >  12 

(yz  +  zu)  <  40 

After  the  first  run,  the  following  bounds  are  obtained: 

12  <  x  <  44,  1  <  y  <  oo, 

6  <  z  <  oo,  0  <  u  <  3. 

After  the  second  run,  the  bounds  are 
'2  <  x  <  40/3,  1  <  y  <  10/9, 

6  <  z  <  40,  9/11  <  u  <3. 

And,  finally  the  third  run  gives  the  best  bounds, 

12  <  x  <  40/3,  1  <  y  <  10/9, 

6  <  z  <  220/9,  9/11  <u<3. 

This  example  also  indicates  that  the  order  in  which  vari¬ 
ables  are  processed,  becomes  extremely  important  if  we  wish 
to  reduce  the  number  of  runs  to  obtain  the  best  bounds. 

Even  for  an  unsatisfiable  set  of  nonlinear  constraints,  multi¬ 
ple  runs  of  the  SUP-INF  method  may  be  required  for  detecting 
inconsistency.  The  following  example,  which  is  a  modification 
of  the  above  example,  illustrates  this. 

x  >  12,  y  >  1,  z  <  -7,  u  <  3. 

2 zu  >  3x.  yzu  <  12,  (yz  zu)  <  -4 
After  the  first  run,  the  bounds  obtained  are: 

12  <  x  <  oo,  1  <  y  <  oo, 

-oo  <  z  <  -33,  - 12/7  <  a  <  0. 

And,  in  the  next  run.  the  inconsistency  is  detected. 


It  is  not  clear  whether  for  each  problem,  there  is  a  suitable 
order  in  which  variables  can  be  processed  to  get  the  best  results 
from  Brooks’  extension  in  a  single  run. 

4.4  Splitting  intervals  or  case  analysis 

As  the  above  examples  illustrate,  in  the  case  of  nonlinear  in¬ 
equalities,  the  SUP-INF  method  may  not  generate  good  bounds 
on  variables  in  a  single  run.  It  also  turns  out  that  there  are 
examples  for  which  better  bounds  exists  (in  the  form  of  more 
than  one  interval),  but  the  SUP-INF  method  cannot  find  it. 
In  such  cases,  even  multiple  runs  may  not  be  helpful.  For  the 
following  example, 

1  >  -1,  y  >  1,  3  >  z,  xy  =  4x,  yz  =  2z. 

When  the  Grobner  basis  algorithm  is  run  first  to  process 
equality  constraints,  an  additional  equality  constraint  xz  =  0 
is  generated.  From  these  constraints,  the  SUP-INF  method 
cannot  really  get  any  good  bounds;  it  produces  the  following 
results: 

-1  <  x  <  oo,  1  <  y  <  oo,  and  -oo  <  z  <  3 

However,  better  bounds  can  be  obtained  if  the  above  inter¬ 
val  for  each  variable  is  further  split  into  subintervals.  We  have 
incorporated  this  into  the  combined  method.  So  if  xz  =  0, 
then  case  analysis  can  be  performed  in  a  way  similar  to  natu¬ 
ral  deduction  method  or  semantic  tableaux  method  for  theorem 
proving  in  propositional  calculus,  and  the  problem  is  split  into 
two  subproblems  -  corresponding  to  x  =  0  or  z  =  0. 

For  the  case  when  z/0,  then  y  =  4  and  z  =  0.  If  x  =  0, 
however,  and  if  z  ^  0,  then  y  -  2.  For  the  case  when  x  =  0 
and  z  =  0,  then  1  <  y  <  oo. 

Splitting  a  satisfiable  interval  of  a  variable  is  often  helpful 
in  refining  the  bounds  of  variables  as  there  can  be  subintervals 
for  variables  over  which  the  constraints  may  not  be  satisfiable. 

Performing  case  analysis  (or  splitting)  based  on  parity  usu¬ 
ally  reduces  the  number  of  times  thp  combined  method  may 
need  to  be  run  in  order  to  detect  the  inconsistency  or  get  good 
bounds. 

We  have  so  far  experimented  with  case  analysis  on  variables; 
it  is  possible  to  extend  that  analysis  to  expressions  or  terms. 

5  Imprecise  Data  and  Parameterized 
Models 

The  main  motivation  for  implementing  Brooks’  method  for  rea¬ 
soning  about  inequalities  is  to  handle  errors  in  image  coordi¬ 
nates  as  well  as  to  have  the  ability  to  specify  partial  information 
about  models.  An  example  of  the  latter  case  is  that  of  a  par¬ 
tial  three  dimensional  model  of  some  object  where  some  of  the 
vertices  are  not  explicitly  specified,  but  rather  constrained  by 
a  set  of  relationships  on  model  parameters.  A  typical  example 
could  be  a  building  with  unknown  height,  but  known  base. 

Consider,  for  a  simple  example,  a  rectangular  face  with  un¬ 
known  height  and  width,  but  with  bounded  area.  If  this  fare 
lies  in  the  plane  •/.  =  10,  then  the  following  equations  define  this 
parameterized  model: 


xl  =  -/!, 

y  i  =  ->'■ 

zl  =  10. 

x2  =  to, 

j/2  =  —h. 

z2  =  10 

x3  =  -to. 

1/3  =  h. 

z3  =  10. 

xl  =  to, 

1/4  =  h. 

zl  =  10. 

90  <  hw  <  110 

We  would  like  to  find  whether  a  given  image  can  possibly 
be  an  image  of  an  instance  of  the  parameterized  model.  Fur¬ 
thermore,  if  this  is  indeed  an  image  of  the  model,  we  want  to 
determine  the  transformation  from  the  model  coordinate  sys¬ 
tem  to  the  image  coordinate  system  as  well  as  the  unknown 
parameters  h  and  w.  In  the  general  case,  we  will  not  be  able  to 
completely  determine  these  unknowns,  but  only  further  refine 
them,  just  as  in  the  view  consistency  problem,  a  single  assign¬ 
ment  between  vertex  groups  did  not  completely  determine  the 
vertex  depths. 

In  the  experiments  that  we  performed  the  image  coordi¬ 
nates  were: 

ul  =  -176707/12500,  ul  =  -204131/25000, 

u2  =  -353301587/25000000,  v2  =  815939/100000, 
u3  =  26413/25000000,  «3  =  -61/100000, 

u4  =  -43/12500,  t>4  =  -408131/25000. 

These  coordinates  result  when  h  =  10  and  w  =  10  in  the 
above  model  and  we  rotate  the  model  by  45  degrees  around 
the  x-axis,  —35.26  degrees  around  the  t/-axis,  and  —30  degrees 
around  the  z-axis. 

However,  a  priori  we  do  not  know  the  above  exact  coor¬ 
dinates,  but  only  approximations  to  them.  So  we  represented 
them  as  lying  in  intervals  bounded  by  two  rational  numbers. 

In  order  to  test  our  preliminary  implementation  of  the 
combined  method,  we  tried  several  simple  experiments  on  the 
model  and  image  described  above.  These  experiments  are  of 
the  following  form: 

1.  We  start  with  the  above  equations  describing  the  model 
as  well  as  the  equations  describing  the  affine  projection. 

2.  We  also  include  a  partial  specification  of  the  transform? 
tion  between  the  model  and  image. 

3.  We  run  the  Grobner  basis  algorithm  on  this  set  of  equa¬ 
tions.  The  result  is  a  set  of  equations  relating  transfor¬ 
mation  and  model  parameters  to  image  coordinates. 

4.  We  next  add  inequality  constraints  specifying  the  range 
of  values  for  each  image  coordinate;  we  also  add  the  in¬ 
equality  constraining  the  model  parameters. 

5.  We  run  the  SUP-INF  algorithm  on  these  constraints, 
specifying  the  image  coordinate  variables  as  input  vari¬ 
ables. 

The  goal  in  each  case  is  to  find  whether  (i)  the  above  image 
is  indeed  an  image  of  an  instance  of  the  parameterized  model, 
and  (ii)  if  so,  to  find  tight  bounds  for  the  unknown  parameters. 

The  specific  experiments  follow: 

Experiment  I:  The  transformation  is  completely  specified 
by 

cosip  =  .707,  simp  =  .707, 
cosff  =  .816,  sinO  =  -.577, 
cos<p  =  .866,.sind>  =  .5, 

and  the  affine  scale  factor  is  1.  Excep*  for  the  constrain* 
on  the  area  of  the  face,  this  gives  a  completely  linear  set  of 
inequalities.  The  SUP-INF  algorithm  quickly  determined 
the  values  of  h  and  w  as  10  within  an  error  of  ±0.5. 

Experiment  2: 

Except  for  the  affine  scale  factor,  the  transformation  pa¬ 
rameters  are  known.  The  constraints  in  this  case  become 


predominantly  nonlinear.  The  SUP-INF  algorithm  suc¬ 
cessfully  determined  tight  bounds  for  h  and  w;  in  addi¬ 
tion,  the  scale  was  found  to  be  1  within  an  error  of  ±0.1 

Experiment  3: 

The  rotation  angle  around  the  z-axis  is  unknown.  Con¬ 
straints  now  become  quite  complicated  non-linear  in¬ 
equalities.  Our  initial  experiment  failed  to  refine  the 
bounds  for  the  unknown  parameters  although  the  con¬ 
straints  were  more  than  sufficient  to  do  so.  We  conjecture 
that  this  reflects  a  deficiency  in  Brook’s  extension  to  the 
SUP-INF  method,  but  have  not  yet  verified  this. 

In  order  to  obtain  some  bounds,  we  represented  the  unit 
circle,  sin2(<j>)  +  cos2(4>)  —  1  by  a  piece-wise  linear  ap¬ 
proximation.  To  compensate  for  this  approximation  we 
changed  the  previous  equalities  describing  the  relation 
between  the  model  ai  d  image  coordinates  into  approxi¬ 
mate  equalities,  with  an  error  equal  to  ±A.  With  this 
approximation,  the  SUP  INF  method  succeeded  as  long 
as  A  was  sufficiently  large. 

Experiment  4: 

We  left  one  parameter,  scale,  in  the  transformation  un¬ 
known,  and  dropped  the  image  coordinates  of  one  vertex 
to  simulate  a  missing  vertex  in  the  image.  This  experi¬ 
ment  is  the  the  same  as  Experiment  2,  except  the  values 
for  u4  and  v4  are  unknown. 

When  the  parity  of  u4  and  v4  is  provided,  the  SUP-INF 
method  found  the  values  for  h  and  w  within  an  error  of 
±2.  Using  these  bounds,  we  were  able  to  determine  tne 
bounds  for  u4  and  v4. 

References 

Arnon,  D.S.,  Collins,  G.E.,  and  McCallum,  S.,  “Cylindrical 
Algebraic  Decomposition  I:  The  Basic  Algorithm,”  SIAM  J.  of 
Computing  13,  865-877,  1984. 

Arnon,  D.S.,  Collins,  G.E.,  and  McCallum,  S.,  “Cylindrical 
Algebraic  Decomposition  II:  An  Adjacency  Algorithm  for  the 
Plane,”  SIAM  J.  of  Computing  13,  878-889,  1984. 

Barry,  M.,  Cyrluk,  D.,  Kapur,  D.,  and  Mundy,  J.,  “A  Multi¬ 
level  Geometric  Reasoning  System  for  Vision,”  Proc.  of  an  NSF 
Workshop  on  Geometric  Reasoning,  Oxford,  England,  June 
1986.  To  appear  in  a  special  issue  of  the  Artificial  Intelligence 
Journal. 

Besl,  P.,  and  Jain,  R.,  “Three-Dimensional  Object  Recog¬ 
nition,”  Computing  Surveys  17  (1),  1985. 

Bledsoe,  W.W.,  “A  new  method  for  proving  certain  Press- 
burger  formulas,”  Advance  Papers,  Fourth  Inti.  Joint  Conf. 
on  Artificial  Intelligence ,  Tibilisi,  Russia,  Sept.  1975,  15-21. 

Brooks,  R.A.,  “Symbolic  Reasoning  Among  3D  Models  and 
2D  Images,”  Artificial  Intelligence  17,  1981. 

Buchberger,  B.,  An  algorithm  for  finding  a  basis  for  the 
residue  class  ring  of  a  zero-dimensional  polynomial  ideal,  (in 
German)  Ph.D.  Thesis,  Univ.  of  Innsbruck,  Austria,  1965. 

Buchberger,  B,  “A  criterion  for  detecting  unnecessary  re¬ 
ductions  in  the  construction  of  Grobner  bases,"  Proc.  of  EU- 
ROSAM  79.  Lecture  Notes  in  Computer  Science  72,  Springer 
Verlag,  3-21,  1979. 


738 


Buchberger,  B  “Grdbner  bases:  An  algorithmic  method  in 
polynomial  ideal  theory,”  in:  N.K.  Bose  (ed.)  Multidimensional 
Systems  Theory ,  Reidel,  184-232,  1985. 

Connolly,  C.I.  and  Stenstrom,  J.R.,  “Construction  of  Poly¬ 
hedral  Models  from  Multiple  Range  Views,”  Proc.  8th  Inter¬ 
national  Conference  on  Pattern  Recognition,  1986. 

Cyrluk,  D.,  Kapur,  D.,  Mundy,  J.,  and  Nguyen,  V.,  “Forma¬ 
tion  of  partial  3D  models  from  2D  projections  -  an  application 
of  algebraic  reasoning,”  198 7  DARPA  Image  Understanding 
Workshop,  Feb.  1987,  Los  Angeles,  Calif. 

Herman,  M.,  Representation  and  Incremental  Construction 
of  a  Three-Dimensional  Scene  Model,  Carnegie- Mellon  Univer¬ 
sity  Report,  CMU-CS-85-103,  1985. 

Kaufl  ,  T.,  “Program  verifier  Tatzelwurm:  Reasoning  about 
systems  of  linear  inequalities,”  Proc.  of  Eighth  Inti.  Conf.  on 
Automated  Deduction,  Oxford,  England,  July  1986. 

Kapur,  D.,  Musser,  D.R.,  and  Narendran,  P.,  “Only  prime 
superpositions  need  be  considered  in  the  Knuth-Bendix  com¬ 
pletion  procedure,”  Unpublished  Manuscript,  General  Electric 
Corporate  Research  and  Development,  Schenectady,  New  York. 
To  appear  in  J.  of  Symbolic  Computation,  1985. 

Kapur,  D.,  Sivakumar,  G.,  and  Zhang,  H.,  “RRL:  A 
Rewrite  Rule  Laboratory,”  Proc.  of  8th  Inti.  Conf.  on  Auto¬ 
mated  Deduction  (CADES),  Oxford,  England,  Lecture  Notes 
in  Computer  Science  230,  Springer  Verlag,  1986. 

Kapur,  D.,  “Geometry  theorem  proving  using  Hilbert’s 
Nullstellensatz,”  Proc.  1986  Symposium  on  Symbolic  and  Alge¬ 
braic  Computation  (SYMSAC  86),  Waterloo,  Canada,  202-208, 
1986. 

Knuth,  D.,  and  Bendix,  P.,  “Simple  word  problems  in  uni¬ 
versal  algebras,”  in:  J.  Leech  (ed.)  Computational  Problems  in 
Abstract  Algebras,  Pergamon  Press,  1970. 

Sacks,  E.P.,  “Hierarchical  inequality  reasoning,”  Proc.  the 
National  Conf.  on  Artificial  Intelligence,  AAAI,  1987,  649-654. 

Shostak,  R.,  “On  the  SUP-INF  method  for  proving  Press- 
burger  formulas,”  J.  ACM  24  (4),  Oct.  1977,  529-543. 

Shostak,  R.,  “Deciding  linear  equalities  by  computing  loop 
residues,”  J.  ACM  28  (4),  Oct.  1981,  769-779. 

Strat,  T.,  “Spatial  Reasoning  From  Line  Drawings  of  Poly- 
hedra,”  Proc.  of  IEEE  1984  Workshop  on  Computer  Vision 
Representation  and  Control. 

Sugihara,  K.,  “An  Algebraic  Approach  to  Shape  From  Im¬ 
age  Problems,”  Artificial  Intelligence  23,  1984. 

Thompson,  D.  and  Mundy,  J.,  “Three-Dimensional  Model 
Matching  From  an  Unconstrained  Viewpoint,”  Proc.  of  IEEE 
Conf.  on  Robotics  and  Automation ,  April  1987. 

van  der  Waerden,  B.L.,  Modem  Algebra,  Vol.  I  and  II. 
Frederick  Ungar  Publishing  Co.,  New  York,  1966. 


ft 

m 


JW 


Test  Results  from  SRI’s  Stereo  System 


Marsha  Jo  Hannah 

Artificial  Intelligence  Center,  SRI  International 
333  Ravenswood  Avenue,  Menlo  Park,  CA  94025 


Abstract 

This  paper  briefly  describes  the  automatic  system,  STF.REO- 
SYS,  which  has  been  developed  at  SRI  for  stereo  compilation. 
This  system  uses  area-based  correlation,  but  applies  this  basic 
technique  in  a  variety  of  novel  ways  to  develop  a  disparity  model 
for  a  given  stereo  image  pair.  The  techniques  used  are  hierar¬ 
chical  in  nature,  and  incorporate  iterative  refinement,  as  well  as 
a  best-first  strategy,  in  the  matching  process.  This  paper  also 
presents  results  obtained  with  this  system  on  test  data  recently 
distributed  bv  the  International  fiocietv  of  Photogrammetry  and 
Remote  Sensing,  Working  Group  1 1 1/4  (Mathematical  Aspects 
of  Pattern  Recognition  and  Image  Analysis)  as  part  of  their  Test 
A  on  Image  Matching.  ISPRS  has  not  yet  completed  detailed 
analysis  of  these  results,  but  preliminary  indications  a.re  that 
S  I'EREOSYS  performed  better  than  all  other  tested  algorithms. 


Introduction 

Over  the  past  several  years,  SRI  has  integrated  and  improved 
existing  pieces  of  stereo  software  to  form  STEREOSYS,  a  base¬ 
line  system  for  automated,  area-based  stereo  compilation  [Han¬ 
nah,  1985].  This  system  operates  in  several  passes  over  the  data, 
during  which  it  uses  several  different  matching  algorithms  iter 
ativcly  to  build  and  refine  its  model  of  a  portion  of  the  three- 
dimensional  (3-D)  world  represented  by  a  pair  of  images.  The 
underlying  strategy  is  to  begin  with  a  few  points  that  are  most 
likely  to  be  matchable  (based  on  their  “interest”,  i.e.  information 
content);  these  are  matched  by  very  global,  but  very  conserva¬ 
tive,  search  algorithms.  Each  successive  algorithm  operates  on 
less  promising  points,  but  uses  more  information  from  matches 
made  at  previous  levels  to  constrain  the  search  to  smaller  and 
smaller  portions  of  the  epipolar  line,  until  eventually  all  interest¬ 
ing  points  have  been  processed. 

In  November  of  1980,  we  were  invited  to  join  an  international 
group  of  photogrammetrists  and  computer  vision  researchers  par¬ 
ticipating  in  a  test  of  image  matching  capabilities.  Each  par¬ 
ticipant  received  the  same  12  pairs  of  digital  images,  which  he 
was  asked  to  process  automatically,  returning  the  results  to  the 
International  Society  of  I’hotogrammetry  and  Remote  Sensing's 
Working  Group  III /-I  (Mathematical  Aspects  of  Pattern  Recog¬ 
nition  and  Image  Analysis)  for  evaluation.  Participants  had  the 
options  of  producing  matches  on  a  specified  regularly  spaced  grid 
of  points  in  t  he  left  image  (  Task  A ),  of  selecting  left  -image  points 
and  matching  them  (Task  H).  or  of  performing  other  specified 
measurement  tasks  that  varied  from  one  image  pair  to  another. 
This  paper  briefly  describes  STEREOSYS  and  presents  its  test 
results. 


Description  of  STEREOSYS 

The  matching  algorithms  in  STEREOSYS  begin  by  selecting 
a  set  of  well-scattered  windows  in  one  image,  such  that  each  win¬ 
dow  contains  sufficient  information  to  produce  a  reliable  match. 
To  accomplish  this,  a  statistical  operator  is  passed  over  the  image; 
this  operator  is  a  product  of  the  image  variance  and  the  mini¬ 
mum  of  ratios  of  directed  differences  (hence  edge  strength)  over 
windows  of  the  specified  size  [Hannah,  1980].  Local  peaks  in  the 
output,  of  this  operator  (the  “interesting”  points)  are  recorded  as 
the  preferred  places  to  attempt  the  matching  process.  To  ensure 
that  the  selected  windows  are  well-scattered  in  the  image,  the 
image  is  divided  into  a  grid  of  subimages,  and  the  relative  ranks 
of  the  interesting  points  within  their  grid  cell  are  recorded;  this 
permits  the  most  interesting  points  in  each  area  to  be  matched 
first. 

Whether  or  not  point  (i],(/i)  in  the  first  image  l\  is  matched 
by  point  (-t^y?)  in  the  second  image  12  is  determined  by  com¬ 
puting  the  cross-correlation,  normalized  by  both  mean  and  vari¬ 
ance,  over  windows  surrounding  the  points  [Hannah,  1974].  The 
matching  point  is  taken  to  be  the  point  in  / 2  with  highest  cor¬ 
relation,  as  located  by  one  of  several  search  algorithms.  All  of 
our  matching  algorithms  use  image  hierarchies  to  some  extent. 
Pixels  in  each  reduced  image  of  the  hierarchy  are  produced  by 
convolving  the  parent  image  with  a  Gaussian,  then  sub-sampling 
fllurt,  1980]. 

The  first  matching  algorithm,  unconstrained  hierarchical 
matching,  matches  each  specified  point  (usually  the  most  in¬ 
teresting  point  in  each  grid  cell)  using  an  unguided  hierarchical 
matching  technique  [Moravec,  1980].  This  technique  permits  the 
use  of  the  overall  image  structure  to  set  the  context  for  a  match; 
the  gradually  increasing  detail  in  the  imagery  is  then  followed 
down  through  the  hierarchy  to  the  final  match. 

In  this  matching  technique,  as  in  all  the-  ot  hers  we  use.  matches 
must  pass  fairly  strict  tests  in  order  to  be  considered  correct. 
At  any  level  in  the  hierarchy,  matches  with  poor  correlation  are 
discarded,  as  are  matches  that,  fall  outside  of  the  image.  Each 
match  must  also  be  confirmed  by  back-matching,  that  is.  if  we 
have  found  t hat.  point  (r  1 .  1/1 )  in  'lie  first,  image  /]  is  best  matebefl 
by  (fflfc)  in  the  second  image  / 2,  we  then  repeat  the  entire 
matching  algorithm,  starting  with  (xj,^)  in  l2  and  searching 
for  tin1  point  (.r,.//,)  in  /,  that  best  matches  (.r2.  >j2 ).  If  ( .r  f ,  j/j  ) 
and  ( .r j .  1/ , )  differ  by  more  than  one  pixel,  the  entire  match  is 
discarded  as  being  unreliable. 

If  no  ti  intrni  absolute  tumcia  mode,  ,ias  oeeu  given  with  the 
data  set ,  we  next  calculate  a  simplistic  relative  camera  model 
from  the  set  of  point  pairs  produced  by  unconstrained  hierarchi¬ 
cal  matching.  This  is  accomplished  by  searching  for  live  angles 
that  best  describe  the  relative  positions  and  orientations  of  two 
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ideal  pinhole  cameras  [Hannah,  1974], 

The  next  technique  to  be  applied  is  epipolar  constrained  hier¬ 
archical  matching.  Having  determined  the  camera  parameters, 
we  now  know  the  manner  in  which  a  point  in  the  first  image 
projects  to  a  line  in  the  second  image — the  epipolar  constraint. 
This  constraint  allows  us  to  cut  the  search  from  two  dimensions 
(all  around  the  point)  to  one  dimension  (back  and  forth  along 
the  epipolar  line)  at  each  level  of  the  hierarchy.  In  all  other  re¬ 
spects,  epipolar  constrained  hierarchical  matching  proceeds  very 
much  like  unconstrained  hierarchical  matching,  and  is  used  on 
any  unmatched  points  amongst  the  two  most  interesting  points 
for  each  grid  cell. 

Once  a  good  basis  of  reliable  matches  has  been  found,  these 
matches  can  be  used  as  “anchor”  points  for  the  anchored  match¬ 
ing  technique.  Under  the  assumption  that  the  world  is  generally 
continuous,  a  point  would  be  expected  to  have  a  depth  similar  to 
that  of  its  neighbors.  Thus,  the  disparity  for  a  point  is  expected 
to  lie  in  the  interval  of  the  disparities  of  the  well-matched  points 
in  the  current  and  neighboring  cells  in  the  grid  of  subimages. 
This  disparity  interval  is  used  along  with  the  epipolar  constraint 
to  perform  a  very  local  search  for  the  match  to  a  point,  perhaps 
proceeding  one  or  two  levels  up  the  image  hierarchy,  to  provide 
context  for  the  match. 

Our  system  also  can  produce  matched  points  on  a  regularly 
spaced  grid,  if  desired.  This  matching  algorithm  also  uses  the 
anchored  match  technique,  searching  along  specified  portions  of 
the  epipolar  line,  to  calculate  matches  for  the  user-specified  grid 
of  points  in  the  first  image.  However,  holes  can  result  if  a  grid 
point  does  not  have  suitable  information  for  matching,  and  only 
matches  that  pass  all  the  tests  are  recorded. 

Results  on  Image  Matching  Test  A 

For  each  of  the  12  image  pairs  in  ISPRS’s  Test  A  on  Im¬ 
age  Matching,  we  attempted  to  perform  Standard  Task  B  — 
determination  of  the  parallaxes  at  selected  points— which  is  what 
STEREOSYS  does  best;  for  a  few  of  the  images,  we  also  at¬ 
tempted  to  determine  the  parallaxes  at  a  grid  of  points.  In  most 
cases,  these  240x240  images  were  run  with  the  standard  param¬ 
eters  for  the  system,  which  had  been  tuned  to  process  a  high- 
quality,  1024x1024  aerial  image  pair.  Because  of  incompatibili¬ 
ties  in  format,  we  did  not  use  the  camera  orientation  information 
given  with  each  test  image  pair,  or  any  other  a  priori  informa¬ 
tion;  we  used  the  raw  images,  without  transforming  to  normal 
images,  or  doing  any  other  resampling. 

For  the  most  part,  the  matching  proceeded  routinely,  using  the 
standard  procedures  and  parameters.  One  parameter  a  thresh¬ 
old  that  indirectly  controls  the  number  of  interesting  points  that, 
the  system  has  to  work  with  was  changed  for  each  data  set; 
this  was  done  to  produce  between  100  and  .'100  points  per  image, 
as  requested  for  the  test.  In  what  follows,  we  outline  processing 
steps  that  deviated  from  the  usual  and  comment  on  the  overall 
results.  In  the  figures  that  illustrate  our  results,  matches  are 
shown  with  one  of  two  different  marks;  the  smaller  marks  corre¬ 
spond  to  matches  with  correlations  less  than  0.3  .  The  system 
has  discarded  matches  that  it  was  not  confident  about,  but  we 
have  done  no  other  editing  of  its  final  results. 

On  the  images  for  Test  1  (named  Car  I  part  of  the  engine 
compartment,  of  a  car,  sprayed  with  undercoaiing).  the  system 
produced  335  interesting  points,  of  which  it  matched  330  (Fig¬ 
ure  1 ).  We  also  attempted  to  match  a  grid  of  points,  which  pro¬ 


duced  507  matches  (Figure  2).  The  interesting  points  are  fairly 
well  scattered  throughout  the  image,  providing  a  good  base  of 
matches  for  further  processing.  The  grid  of  points  is  fairly  com¬ 
plete.  From  a  brief  visual  inspection,  all  of  the  matches  appear 
to  be  reasonably  correct. 


Figure  1:  Test  1  (Car  I)-Parallaxes  at  selected  points. 


Figure  2:  Test  1  (Car  I)~Parallaxes  at  a  grid  of  points. 

On  Test  2  (Quarry— an  aerial  veiw  of  a  rock  quarry),  the  sys¬ 
tem  produced  275  interesting  points,  of  which  it  matched  249 
(Figure  3).  We  also  attempted  to  match  a  grid  of  points,  which 
produced  437  matches  (Figure  4).  Again,  the  interesting  points 
are  fairly  well  scattered  throughout  the  image,  providing  a  good 
base  of  matches  for  further  processing,  although  there  are  a  few 
blank  spots.  The  grid  of  matches  has  more  holes  than  the  pre¬ 
vious  example,  but  is  still  fairly  complete.  All  of  the  matches 
appear  to  be  reasonably  correct.  The  foreground  post  was  found 
to  be  interesting,  but  those  matches  were  at  the  disparity  of  the 
background,  with  degraded  correlations. 


Figure  3:  Test  2  (Quarry)  Parallaxes  at  selected  points. 

On  Tost  3  (Olvmpia  1  an  oblique  view  of  part  of  the  pb  xiglas 
roof  of  the  Olympia  tent  in  Munich),  the  system  produced  171 
interesting  points,  of  which  it  matched  139  (Figure  5).  Because 
of  the  sparse  information  in  this  image,  we  did  not  attempt  to 
produce  matches  on  a  grid.  Our  system  had  considerable  trouble 
with  this  image  pair,  because  of  t  he  general  lack  of  information 
other  than  the  ambiguities  caused  by  the  many  identical  buttons. 
Performing  unconstrained  hierarchical  matching  on  just  the  sin¬ 
gle  most  interesting  point  in  each  grid  tell  did  not  produce  enough 
matches  to  form  a  camera  model,  so  we  started  over,  asking  for 
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Figure  4:  Test  2  ( Quarry )-Parallaxes  at  a  grid  of  points. 


Figure  6:  Test  4  (South  AmericaJ-Parallaxes  at  selected  points. 


the  four  most  interesting  points  per  cell  to  be  attempted.  Of  the 
25  matches  produced  this  way,  one  was  clearly  wrong;  the  cam¬ 
era  model  solver  was  unable  to  converge  to  a  model,  and  hence 
was  unable  to  edit  out  the  point.  In  this  instance  (the  only  time 
we  did  so  for  any  test),  we  removed  the  offending  point  by  hand, 
then  tried  the  camera  model  solution  again,  with  good  results. 
(If  our  model  solver  included  the  RANSAC  technique  [Fischler 
and  Bolles,  1980],  we  believe  it  could  have  proceeded  without  in¬ 
tervention.)  To  be  consistent  with  the  unconstrained  matching, 
the  epipolar  hierarchical  matching  was  also  done  for  any  as-yet 
unmatched  points  amongst  the  four  most  interesting  points  per 
cell,  rather  than  the  usual  two  most  interesting  points.  Doing  an 
anchored  match  with  the  normal  parameters  seemed  to  miss  a  lot 
of  points,  so  we  asked  for  a  second  pass  of  anchored  matching, 
using  all  previous  matches  as  anchors,  doubling  the  number  of 
grid  cells  in  each  dimension,  and  constraining  the  search  to  just 
the  highest  level  of  the  image  hierarchy.  This  produced  a  fairly 
good  base  of  matches,  which  appear  to  be  reasonably  correct. 


Figure  5;  Test  3  (Olympia  I)-Parallaxes  at  selected  points. 


On  Test  4  (South  America — an  aerial  photo  of  an  area  without 
vegetation),  the  system  produced  309  interesting  points,  of  which 
it  matched  169  (Figure  6).  We  also  attempted  to  match  a  grid 
of  points,  but  this  resulted  in  more  "holes"  than  “grid”,  so  wo 
aborted  the  effort.  The  matched  points  arc  fairly  well  scattered 
throughout  the  image.  Because  of  the  poor  image  quality,  most 
of  the  correlations  were  fairly  low.  which  accounted  for  most  of 
the  points  that  the  system  said  could  not  be  matched.  The  better 
matches  that  were  retained  appear  to  be  mostly  correct,  although 
some  of  the  matches  appear  to  be  off  by  a  little.  Because  of  the 
fuzziness  of  the  images,  it  is  difficult  to  assess  the  correctness  of 
the  matches. 

On  Test  5  (Bridge  part  of  a  suspension  bridge  in  the  Grand 
Canyon),  the  system  produced  219  interesting  points,  of  which  it 
matched  180  (Figure  7).  Based  on  our  difficulties  with  Test  3,  we 
began  by  trying  to  match  the  2  most  interesting  points  in  each 
cell  instead  of  just  the  single  most  interesting  point  per  cell,  as 
would  be  usual.  That  worked  quite  well  (i.e.  the  system  probably 
would  have  worked  correctly  with  its  normal  settings),  and  the 
system  operated  routinely  fiom  there.  The  matched  points  are 


fairly  well  scattered  along  the  “wall”  of  the  bridge,  and  are  strung 
along  the  lines  on  the  “floor”  of  the  bridge.  The  matches  appear 
to  be  reasonably  correct,  although  a  couple  of  the  matches  appear 
to  be  off  by  one  wire  in  the  fence,  or  by  one  grid  element  in  the 
floor,  because  of  the  ambiguities  of  repetitive  patterns. 


Figure  7:  Test  5  (Bridge)-Parallaxes  at  selected  points. 


On  Test  6  (Tree — an  assortment  of  tree  branches  against  a 
background  of  sky),  the  system  produced  158  interesting  points, 
of  which  it  matched  109  (Figure  8).  The  interesting  points  are 
strung  out  along  the  various  branches,  marking  some  but  not  all 
of  the  branch  forks  and  crooks.  A  few  “interesting  points”  are  off 
to  the  side  of  a  branch,  rather  than  centered  on  it;  this  is  a  known 
weakness  of  our  interest  operator.  Despite  the  unusual  nature 
of  the  scene,  it  was  processed  with  the  usual  parameters  and 
sequencing  of  routines.  Most  of  the  matches  are  locally  plausible, 
although  some  of  them  have  tracked  “false  intersections” — places 
w  here  one  branch  passes  in  front  of  another,  rather  than  an  actual 
real-world  point 


Figure  8:  Teat  o  (Tiee)- Parallaxes  at  selected  points. 


On  Test  7  (Island— an  aerial  view  of  a  rocky  island  off  the 
coast  of  Sweden),  the  system  produced  311  interesting  points, 
of  whic  h  it  matched  260  (Figure  9).  The  interesting  points  are 
fairly  well  scattered  throughout  the  image,  providing  a  good  base 
of  matches  for  further  processing.  Most  of  the  matches  appear 
to  be  reasonably  correct,  although  there  are  a  few  questionable 
ones  near  the  edges  of  the  image. 

On  lest  .9  (Switzerland — an  aerial  view  of  a  smooth  hill  with 
scattered  trees,  taken  under  different  lighting  conditions),  the 
system  produced  279  interesting  points,  of  which  it  matched 


Figure  9:  Test  7  ( Island  (-Parallaxes  at  selected  points. 

ra6  (Figure  10).  The  interesting  points  are  fairly  well  scat¬ 
tered  throughout  the  image.  The  matches  appear  to  be  substan¬ 
tially  correct,  although  they  represent  a  mixture  of  tree  tops  and 
ground  points,  since  our  system  is  unable  to  distinguish  between 
the  two.  Many  of  the  correlations  are  low,  possibly  because  the 
two  images  appear  to  have  been  recorded  at  significantly  different 
times  of  day. 


Figure  10:  Test  8  (Switzerland)-Parallaxes  at  selected  points. 

On  Test  9  (Car  II — another  view  of  the  car’s  engine  compart¬ 
ment),  the  system  produced  271  interesting  points,  of  which  it 
matched  168.  using  the  usual  prc  cessing  routine  and  parameters, 
despite  the  wide  range  of  disparities  (Figure  11).  The  matched 
points  are  concentrated  in  a  few  dusters  on  high-intensity  areas 
of  the  image.  It  is  difficult  to  fuse  these  images  long  enough  check 
the  results,  but  most  of  the  matches  appear  to  be  approximately 
correct,  with  the  possible  exception  of  a  few  matches  near  the 
headlight. 


Figure  11:  Test  9  (Car  II)-raraI  axes  at  selected  points. 

On  Test  10  (Wall  an  oblique  view  of  the  working  fare  in  a 
stone  quarry),  the  system  produced  22-1  interesting  points,  of 
which  it  matched  191  (Figure  12).  The  matched  points  are  fairly 
well  scattered  throughout,  the  image,  avoiding  blank  spots  and 
the  very  linear  edges  between  fares.  The  matches  appear  to  be 
reasonably  correct,  with  a  few  possible  problems  near  the  edge 
if  the  image. 

On  Test  II  (Olympia  II  another  view  of  the  Olympia  lent), 
the  system  produced  197  interesting  points  (Figure  13).  After 
the  difficulty  we  had  with  Test  3  (Olympia  1).  we  tried  matching 
the  four  most  interesting  points  in  each  cell.  Of  these  133  points. 
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Figure  12:  Test  10  (Wall)-Parallaxes  at  selected  points. 

only  4  resulted  in  matches  that  our  system  thought  were  reliable, 
based  on  its  local  criteria,  and  none  of  these  matches  were  actu¬ 
ally  correct.  We  therefore  abandoned  processing  on  this  image 
pair.  We  believe  that  the  combination  of  repetitive  structures, 
the  rather  different  points  of  view,  and  the  transparency  of  the 
dome  caused  our  hierarchical  matching  techniques  to  fail.  If  we 
had  elected  to  use  the  accompanying  camera  information,  we 
might  have  been  able  to  do  some  matching,  although  the  trans¬ 
parency  would  undoubtedly  have  lead  to  numerous  problems. 


Figure  13:  Test  11  (Olympia  II)-Unable  to  determine  parallaxes. 

On  Test  12  (House — an  aerial  view  of  a  house  and  fence  on 
rolling  terrain),  the  system  produced  119  interesting  points,  of 
which  it  matched  92  (Figure  14).  The  interesting  points  are 
poorly  distributed  throughout  the  image,  avoiding  the  feature¬ 
less  areas  of  the  driveway,  lawn,  and  roof,  and  the  linear  edges  of 
the  shadow.  The  matches  appear  to  be  reasonably  correct,  even 
in  the  difficult  area  where  the  back  of  the  house  falls  off  to  the 
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Figure  14:  Test  12  (House)  Parallaxes  at  selected  points. 


Summary 

in  this  paper,  we  have  described  STKREOSYS,  our  automatic 
system  for  stereo  compilation,  which  uses  area-based  correlation, 
but  applies  this  basic  technique  in  a  variety  of  novel  ways.  Our 
techniques  are  hierarchical  in  nature,  and  use  iterative  refinement 
with  a  best  first  strategy,  in  the  matching  process,  as  well  as 
the  constraint  of  backmatchiug  to  verify  matches.  We  have  also 
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presented  the  results  of  our  system  on  the  Image  Matching  Test  A 
data  set  recently  distributed  by  ISPRS’s  Working  Group  III/4. 
Our  results  on  ISPRS’s  test  imagery  have  again  demonstrated 
the  robustness  of  the  correlation- based  matching  technique,  and 
its  wide  range  of  applicability  over  image  types. 

Overall,  we  are  very  pleased  with  the  results  of  our  system. 
For  10  of  the  12  tests,  we  were  able  to  handle  the  image  pairs 
without  substantially  altering  the  default  parameters  or  process¬ 
ing  sequence.  Our  only  problems  came  on  the  images  of  the 
Olympia  dome;  this  is  not  surprising,  since  our  algorithms  were 
designed  for  use  on  highly  textured  natural  terrain,  not  the  bland 
faces  and  transparency  of  cultural  objects.  For  the  most  part, 
our  results  appear  to  be  reasonably  correct,  even  in  the  face  of 
large  disparity  ranges  within  small  areas  of  the  image.  As  of  this 
writing,  we  have  not  received  the  results  of  the  committee’s  de¬ 
tailed  analysis,  however,  we  have  been  told  that  our  algorithm 
was  able  to  process  a  larger  number  of  the  images  than  any  other 
participant. 
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ABSTRACT 

The  objective  of  the  PPL  project  is  to  design  and 
implement  a  general  and  modular  logic-programmed  sys¬ 
tem  for  two-dimensional  interpretation  of  image  theories 
in  image  structures  obtained  by  image  analysis.  Impor¬ 
tant  subsystems  include  heuristic  search  for  object 
instances  with  optimization  of  goodness-of-figure,  and 
procedures  for  computing  basic  image  components, 
locales  for  searches,  and  predicates.  We  illustrate  some 
of  these  in  an  application  to  aerial  images  of  suburban 
neighborhoods. 


1.  INTRODUCTION 

Image  analysis  and  interpretation  are  the  principal 
objectives  of  computer  vision.  This  report  describes  a 
modular  logic-programming  system,  called  Programmed 
Picture  Logic  (PPL),  for  two-dimensional  interpretation 
of  images,  and  illustrates  its  application  to  aerial  photos 
of  suburban  neighborhoods. 

The  goal  of  PPL  is  to  specify  and  implement  a 
“constructive  model  theory”  for  logical  theories  of 
images.  The  approach  is  general — essentially  logic- 
programmed  “generate  and  test” — which  combines  sym¬ 
bolic  and  numerical  programs  of  Prolog,  Lisp,  and  C  for 
logical  rep.  esentation  and  inference,  hierarchical  segmen¬ 
tation  and  search,  procedural  evaluation  of  image  predi¬ 
cates,  and  optimization  of  figures.  The  emphasis  in  this 
research  has  been  on  careful  general,  even  simple,  design 
rather  than  upon  specialized  techniques;  this  may  be 
more  effective  with  the  advances  in  supercomputing. 

Important  related  research  in  the  same  area  of 
expert  interpretation  of  aerial  photos  is  by  MoKeown 
(1985)  who  has  developed  a  rule-based  system  for  image 
interpretation,  using  regional  segmentation  and  search, 
which  is  less  complete  than  that  of  PPL.  Selfridge 
(1982)  uses  adaptive  thresholding  for  image  resegmenta¬ 
tion,  while  PPL  uses  type-specific,  multi-level  connected 
component,  analysis.  Rule-based  systems  for  uninter¬ 
preted  image  segmentation  have  been  developed  by  Nazif 
and  Levine  (1984);  however,  there  are  very  few  large 
applications  of  logic  programming  to  expert  computer 
vision.  Recently.  Reiter  and  Mackworth  (1987)  have  dis¬ 


cussed  application  of  formal  logic  in  a  theory  of  depic¬ 
tion  and  interpretation,  although  without  implementa¬ 
tion  by  programs.  [Maier  and  Warren  (1988)  is  an  excel¬ 
lent  introduction  to  logic  programming,  generally, 
besides  Prolog;  Bratko  (1986)  focuses  on  AI  algorithms; 
van  Caneghan  and  Warren  (1986)  gives  other  applica¬ 
tions  of  logic  programming.] 

2.  OVERVIEW  OF  PPL 

The  current  version  of  PPL  is  a  mixed-language  sys¬ 
tem  which  uses  Prolog  for  transparent  definition  of 
semantic  types  and  relations  of  objects  and  for  control  of 
image  analysis  and  interpretation,  and  which  efficiently 
evaluates  image  predicates  by  calls  to  Lisp  or  C  func¬ 
tions. 

The  goal  of  the  PPL  system  is  to  construct  a  largest 
consistent  interpretation  or  model  of  the  Prolog  “image 
theory”,  subject  to  constraints  on  computational  com¬ 
plexity,  by  searching  for  instantiations  within  certain 
combinatorial  topologies  obtained  by  four  kinds  of  com¬ 
binatorial  operations  on  connected  components.  The 
search  attempts  both  to  find  instances  and  to  delineate 
them  accurately.  The  complexity  of  search  is  reduced  by 
a  number  of  means  which  will  be  discussed  later. 

The  “philosophy”  of  PPL  is  essentially  one  of 
sophisticated,  logic-programmed  generate-and-test.  The 
block  diagram  of  the  system,  shown  in  Figure  1,  will  be 
useful  in  our  discussion  of  what  this  means. 

2.1.  Preprocessing  and  Segmentation 

The  input  image  is  first  enhanced  before  segmenta¬ 
tion.  This  involves  recursive  application  of  a  Symmetric 
Neighborhood  Filter  (SNF)  and  optimization  of  the 
resulting  near  fixed-point  image  with  respect  to  the  origi¬ 
nal  image  until  a  more  accurate  enhanced  image  is 
obtained  (Figure  2).  Currently,  this  same  enhanced 
image  is  used  throughout  resegmentation,  so  the  result  of 
interpretation  depends  very  much  on  the  filter's  quality. 
We  present  results  of  enhancing  poor  aerial  images  of 
suburban  neighborhoods.  Also,  Figures  3a-b  show  the 
results  of  enhancing  a  complete  aerial  photograph  with 
very  fine  structure  which  is  obliterated  by  previously 
developed  smoothing  operations. 


Figure  2.  Preprocessing. 

The  enhanced  image  is  segmented  by  a  simple 
gray-level  connected  component  algorithm  into  a 
hierarchical  set  of  segmentations  consisting  of  increas¬ 
ingly  finer  connected  components  (Figure  4).  The  finest 
level  of  segmentation  is  called  the  atomic  segmentation 
and  the  components  at  that  level  are  called  the  atomic 
components.  The  components  at  coarser  levels  of  seg¬ 
mentation  are  represented  as  compositions  of  these.  [We 
admit  the  possibility  that  atomic  regions  may  sometimes 
have  to  be  split  by  geometrical  region-splitting  opera¬ 
tions.  We  can  also  regard  the  connected  component 
algorithm  as  a  kind  of  splitting  of  larger  components 
according  to  gray-level  contrast.]  We  maintain  a  unique 
symbolic  representation  of  every  component  at  every 
level  of  segmentation,  which  also  allows  efficient  sym¬ 
bolic  implementations  of  some  of  the  combinatorial  and 
set  operations  on  regions,  e.g.,  intersection.  In  the 
current  implementation  the  segmentations  are  generated 
before  the  interpretation  starts. 

Next,  labeled  adjacency  graphs  of  segmentations  are 
constructed  for  use  during  search.  Some  of  the  proper¬ 
ties  of  the  connected  components  frequently  used  by  the 


Figure  3.  (a)  Original  image,  (b)  SNF,  fixed  point. 

search  process  are  also  computed:  mean  gray  level,  area, 
and  contrast  at  boundaries.  These  graphs  will  also  be 
used  in  an  analysis  of  whether  a  goal  is  satisfiable  at  all, 
and  to  determine  whether  a  complete  search  is  feasible. 
If  it  is  not,  then  a  more  constrained,  heuristic  search  has 
to  be  applied.  Bit-maps  of  components  are  computed 
only  at  the  atomic  segmentation  level;  the  bit-maps  at 
coarser  levels  are  computed  only  when  necessary.  Seg¬ 
mentation  and  attribution  of  components  is  implemented 
in  Lisp  on  a  Symbolics  computer,  thus  allowing  an 
efficient  implementation  of  heuristic  graph  search. 

2.2.  Image  Theory 

The  image  theory  can  be  considered  as  the  main 
knowledge  base  of  the  system,  and  consists  of  a  set  of 
logical  formulae  representing  a  Prolog  program.  It  can 
be  divided  into  two  parts:  a  semantic  theory  and  an  aux¬ 
iliary  geometric  theory.  The  semantic  theory  specifies 
the  semantic  types  and  relations  between  the  important 
objects  of  the  image  domain.  For  example,  in  the 
domain  of  suburban  neighborhoods  the  variety  of 
definitions  varies  from  simple  ones  such  as  generic 
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Figure  4.  Segmentation  for  search. 


houses,  roads  and  trees  to  complex  ones  such  as  houses 
of  different  shapes  (L-shaped  houses,  houses  with  chim¬ 
neys,  houses  with  patios,  houses  with  different  roofing), 
vehicles,  or  trees  in  the  front  yard  of  a  house.  [The  auxi¬ 
liary  geometric  theory  consists  of  rules  about  mathemat¬ 
ical  operations  on  arbitrary  regions  or  boundaries  in  the 
image:  these  might  include,  for  example,  simple  rules  for 
set,  connectivity,  angular  or  metrical  relations.]  The  fol¬ 
lowing  types  of  objects  are  defined  in  the  image  theory 
of  the  current  version  of  PPL:  houses,  house-adjuncts, 
shadows  of  houses  and  adjuncts,  road  complexes,  trees, 
and  vehicles  on  roads  or  driveways.  The  use  of  the  logi¬ 
cal  representation  of  Prolog  makes  the  expansion  of  the 
image  theory  quite  straightforward. 


2.3.  Meta-Theory 

Meta-theory  is  the  theory  about  finding  a  logically 
consistent  interpretation  of  the  image  theory  within  vari¬ 
ous  analyzed  image  structures;  it  is  implemented  as  a 
meta-interpreter  of  the  image  theory  in  Prolog.  We  may 
think  of  this  as  a  program  for  a  constructive  model 
theory — including  rules  which  control  the  construction 
by  the  Prolog  interpreter  of  a  model  (or  set  of  solution0) 
of  a  given  set  of  axioms  for  images  of  objects.  It 
specifies  the  flow  of  control  between  the  different 
modules  of  the  system  in  terms  of  Prolog  rules  for 
different  types  of  objects  and  also  provides  a  simple  user 
interface  for  specification  of  top-level  system  goals. 


queries  and  certain  parameters  associated  with  appear¬ 
ances  of  objects  (e.g.,  3D  sun  direction,  scale,  gray 
values).  It  makes  use  of  the  inference  mechanism  of  Pro¬ 
log  to  build  a  largest  consistent  interpretation  of  the 
image  theory  with  respect  to  the  image  (i.e.,  each  image 
of  application). 

Meta-theory,  in  addition,  specifies  relevant  parame¬ 
ters  and  rules  which  constrain  search  for  specific  types  of 
objects — what  are  their  locations,  simple  properties,  and 
expected  levels  of  segmentation.  In  the  current  version 
of  PPL  these  are  predetermined  by  prior  experiments 
and  encoded  into  the  image  theory  as  Prolog  facts.  (It 
should  be  noted,  however,  that  we  plan  to  investigate 
interactive  parameter  specification  by  sampling 
instances,  using  the  same  procedures  of  PPL  in  inverse 
mode.  With  this,  we  may  identify  some  instances  and 
have  the  system  find  others  like  them.) 

To  illustrate  the  actions  of  meta-theory  (and  hence 
the  procedure  of  interpretation)  let  us  consider  a  simple 
goal: 

Prove  that  there  exists  an  X  such  that  X  is  a  house. 
This  query  can  be  represented  in  Prolog  as: 

house  (X). 

The  meta-theory  first  tries  to  satisfy  this  goal  by  instan¬ 
tiating  the  variable  X  to  the  first  instance  of  house  that 
can  be  deduced  from  the  current  state  of  interpretation. 
The  current  state  of  interpretation  can  be  obtained  from 
the  Prolog  database  which  contains  facts  about  that  part 
of  the  image  which  has  been  interpreted  so  far.  (Actu¬ 
ally,  the  interpretation  consists  of  these  facts,  plus  all 
those  which  can  be  deduced  from  these  using  the  image 
and  geometric  theories.)  There  are  two  conditions  under 
which  this  goal  may  fail  to  be  satisfied:  the  store  of  facts 
is  incomplete,  so  no  instance  may  be  deduced,  or  there 
simply  aren’t  any  instances  satisfying  the  definition  of  a 
house.  In  the  latter  case  the  system  returns  that  the 
goal  could  not  be  satisfied.  In  the  former  case,  which  is 
more  interesting,  the  meta-theory  initiates  a  search  for 
all  possible  instances.  The  next  sub-section  describes 
this  search  process  in  detail. 

2.3.1.  Search 

The  search  module  searches  through  the  hierarchy 
of  segmentations  under  a  set  of  relaxed  constraints  on 
the  properties  of  the  object  to  find  regions  which  are 
most  likely  to  be  instances  of  the  object  using  a  combi¬ 
nation  of  merging,  splitting,  union  and  intersection. 
(There  are  restrictions  on  the  latter  three  operations, 
which  are  necessary  to  avoid  intractable  complexity.)  A 
block  diagram  of  this  search  is  shown  in  Figure  5.  The 
parameters  given  to  the  search  module  are  specific  to  the 
sought-for  object  type.  They  consist  of  the  bounds  on 
the  segmentation  levels  to  be  searched,  a  set  of  region 
properties  and  a  locale  for  search.  The  region  properties 
determine  the  merging  criteria  and  the  terminating  con¬ 
ditions  for  search  and  are  determined  by  the  meta-theory 
as  mentioned  in  the  last  section.  The  properties  used  in 
the  current  version  are  expected  gray  level,  expected 


area,  tolerance  on  area  and  gray  level,  and  threshold  on 
the  contrast  between  adjacent  components.  The  search 
is  conducted  level  by  level  beginning  from  the  coarsest 
level  of  segmentation.  The  search  space  is  reduced  by 
restricting  the  search  to  a  subgraph  of  the  basic  con¬ 
nected  component  graph  based  on  the  search  locale. 
First,  those  regions  with  average  gray  levels  between  the 
upper  and  lower  bounds  of  the  expected  gray  level  are 
selected  to  be  the  roots  for  the  search  process.  Starting 
from  a  root,  larger  and  larger  regions  are  generated  by 
merging  adjacent  components  (sorted  in  increasing  order 
of  contrast  along  their  boundaries)  that  satisfy  the  merg¬ 
ing  criteria,  and  at  the  same  time  checking  for  search 
terminating  conditions.  The  merging  criteria  used  in  the 
current  implementation  are: 

(1)  The  average  contrast  along  the  boundary  between 
the  two  regions  is  less  than  the  contrast  thresh¬ 
old. 

(2)  The  expected  gray  level  is  within  tolerance 
values. 

(3)  The  total  area  of  the  resulting  region  is  within 
tolerance  values. 

This  cycle  of  merging  and  testing  continues  until  a 
region  whose  area  is  at  least  equal  to  the  expected  area 
and  does  not  exceed  the  upper  bound  on  area  is  obtained 
(successful  search)  or  there  are  no  more  regions  that  can 
be  merged  (unsuccessful  search).  A  search  tree  is  gen¬ 
erated  starting  from  each  root.  If  the  search  rooted  at 
each  of  these  roots  fails  the  search  is  continued  at  the 
next  finer  level,  with  a  new  set  of  roots  at  that  level.  At 
the  end  of  the  search  the  search  module  returns  a  list  of 
leaves  of  the  successful  search  trees.  The  search  module 
is  efficiently  implemented  in  Lisp  on  the  Symbolics;  the 
data  structure  used  is  a  spanning  tree,  used  to  avoid 
redundant  evaluations. 

Due  to  the  exhaustiveness  of  the  search  only  those 
properties  of  regions  which  are  relatively  inexpensive  to 
compute  are  used  in  the  merging  and  terminating  cri¬ 
teria.  Hence  the  regions  returned  by  the  search  module 
may  not  satisfy  the  definition  of  the  object  completely, 
but  rather  only  a  relaxed  definition  of  the  object  based 
on  gray  level,  area,  and  contrast,  besides  the  important 
constraint  on  location  with  respect  to  other  objects. 
Each  of  these  regions  is  admissible  for  evaluation  with 
respect  to  the  full  definition  of  the  object,  involving  pos¬ 
sibly  any  sort  of  image  analysis.  For  example,  compact¬ 
ness  and  rectilinearity  are  checked  for  house  instances. 
Several  Lisp  functions  are  provided  in  the  library  of 
image  predicates  for  computing  some  of  the  properties  of 
regions  used  during  this  evaluation.  The  image  analyses 
implemented  in  the  current  version  are  boundary  extrac¬ 
tion,  compactness  and  rectilinearity.  However,  many 
more  will  have  to  be  added  as  the  objects  in  the  image 
domain  become  more  and  more  complex.  All  the  admis¬ 
sible  instances  and  any  of  their  properties  computed  so 
far  are  recorded  in  the  Prolog  database  as  facts,  so  that 
they  can  be  used  by  the  meta-theory  without  having  to 
recompute  them  if  needed  during  further  inferences. 
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Figure  5.  Search  by  merging. 
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2.3.2.  Optimization 

The  complete  search  described  above  has  the  advan¬ 
tage  that  it  generates  every  instance  that  could  possibly 
be  generated.  However,  it  may  also  generate  many  spa¬ 
tially  overlapping  candidate  instances  of  the  same  type 
of  object.  It  is  the  function  of  the  optimization  process 
to  resolve  this  conflict  by  selecting  the  one  with  the  best 
good ness-of- figure  among  these  overlapping  instances. 
To  check  if  a  candidate  instance  Y  is  optimal  with 
respect  to  a  particular  type  of  object  it  is  compared  with 
every  other  candidate  instance  of  the  object  in  the  data¬ 
base  (note  that  all  instances  are  recorded  in  the  Prolog 
database).  If  a  candidate  instance  X  overlaps  with  Y 
then  their  goodness-of-figure  values  are  compared.  Inter¬ 
section  of  sets  of  atomic  components  forming  a  region 
(atomic  representation)  is  used  to  check  for  overlapping 
regions  (bit-maps  of  the  regions  can  also  be  used).  We 
have  used  average  absolute  value  of  figure-ground- 
contrast  as  a  measure  of  the  goodness-of-figure  (besides 
usual  definitional  constraints  on  shape,  for  example,  sim¬ 
ple  connectedness — no  holes,  or  sufficiently  straight 
edges).  If  the  goodness-of-figure  of  X  is  better  than  that 
of  y  then  all  facts  about  Y  are  deleted  from  the 
database  and  X  replaces  Y  as  the  current  best  instance. 
Otherwise,  all  facts  about  X  are  deleted  from  the  data¬ 
base.  This  is  continued  until  all  candidate  instances  of 
the  object  type  have  been  examined,  when  the  last  is 
determined  to  be  an  instance  having  best  figure. 
Figure-ground  contrast  has  been  found  to  be  a  reason¬ 
able  measure  of  goodness-of-figure  for  most  types  of 
objects  in  our  image  domain. 

3.  IMPLEMENTATION  AND  EXPERIMENTAL 
RESULTS 

3.1.  Implementation 

The  PPL  system  is  a  multi-language  system  (C, 
Lisp  and  Prolog).  Except  for  preprocessing,  which  is 
implemented  on  a  VAX  in  C,  the  rest  of  the  system  is 
implemented  on  a  Symbolics  in  Lisp.  Equivalent 
representational  primitives  have  been  developed  using 
Flavors.  An  image  is  represented  as  an  instance  of  a 
flavor  called  PPL-image.  The  software  required  for  I/O 
is  implemented  in  Lisp.  The  segmentation  and  search 
modules  are  implemented  in  Lisp  using  Flavors.  A  basic 
flavor  called  Component  is  defined  to  represent  nodes  in 
the  search  tree  and  the  graphs.  Another  flavor  called 
Node  is  defined  on  top  of  this  to  represent  instances  of 
objects.  The  rest  of  the  system,  including  the  meta¬ 
theory,  is  implemented  in  Prolog.  Two  dialects  of  Prolog 
are  available  in  the  software  environment  of  the  Symbol¬ 
ics:  DEC-10  Prolog  and  Lisp  Prolog.  The  former  was 
chosen  for  portability  reasons. 

As  is  evident  from  the  discussion  earlier,  we  use 
Prolog's  backward-chaining  capability  for  handling  the 
flow  of  control  between  the  different  modules  of  the  sys¬ 
tem.  To  illustrate  this,  let  us  consider  a  typical  example. 
A  Prolog  rule  for  detecting  shadows  of  simple  houses 
might  be: 


house_shadow(HS)  <—  house(H), 

house_shadow_loeale(H,L), 
candidate_in_locale(L,I  IS), 
defn_shadow(HS), 
optimal_figure(HS). 

This  rule  states  that  HS  is  a  shadow  if  there  exists  a 
house  H  whose  shadow  locale  is  L  and  HS  is  a  good  sha¬ 
dow  lying  within  locale  L  and  is  an  optimal  instance  of  a 
shadow.  The  sub-goal  defn_shadow(HS)  will  itself  be 
another  rule  which  defines  a  house  shadow  based  on  its 
primary  features.  When  the  meta-interpreter  is  asked  to 
satisfy  the  goal  house_shadow(HS),  Prolog  has  to  satisfy 
each  of  the  sub-goals  in  the  body  of  the  rule  in  order  to 
satisfy  this  goal.  The  Prolog  interpreter  tries  each  of  the 
sub-goals  from  left  to  right  and  returns  the  first  instance 
satisfying  the  goal.  Additional  instances  can  be  obtained 
by  asking  the  system  to  resatisfy  the  goal.  Note  that  if 
there  is  no  house  in  the  database  the  system  will  try  to 
deduce  an  instance  of  a  house  using  the  house  rule.  In 
order  to  extend  the  system  to  include  other  objects,  more 
definitions  can  be  added  to  the  Prolog  image  theory 
easily  without  worrying  too  much  about  the  underlying 
control  structure,  although  we  may  have  to  add  pro¬ 
grams  for  new  image  predicates. 

3.2.  Object  Definitions 

The  following  are  definitions  of  the  objects  which 
are  searched  for  by  the  present  implementation  of  PPL. 

A  house  is  a  connected  region  of  the  image  satisfy¬ 
ing  constraints  on  area,  compactness,  average  gray  value, 
boundary  contrast,  and  rectilinearity.  (All  instances  of 
all  types  must  have  best  figure-ground  contrast.)  A  house 
is  rectilinear  if  a  large  proportion  of  its  boundary 
tangents  are  parallel  or  normal  to  one  direction;  we  also 
check  the  ratio  of  frequencies  of  tangents  in  the  major 
(parallel)  and  minor  (normal)  directions. 

A  shadow-of-a- house  is  a  region  within  a  house- 
shadow  locale  satisfying  constraints  on  gray  value,  con¬ 
trast,  and  area  (depending  on  sun  inclination.) 

A  house-shadow  locale  is  a  subset  of  the  image 
determined  by  the  3D  sun  direction,  the  border  of  a 
house  instance,  and  a  hypothesis  of  maximal  house 
height.  (We  only  estimate  this  locale  roughly  since  we 
don’t  know  the  sun  direction  accurately.) 

A  house-adjunct  has  a  definition  like  that  of  a 
house.  Shadow-of-a-house-adjunct  and  its  locale  are 
similar  to  the  corresponding  concepts  for  houses. 

A  road  complex  has  suitable  gray  level,  area,  and 
non-compactness.  (These  weak  definitions  have  been 
found  to  work  well  enough  for  our  current  purposes.) 

A  tree  has  constraints  on  area  and  gray-level. 
(Woods  are  another  matter,  requiring  a  more  compli¬ 
cated  definition.) 

A  vehicle  is  a  small,  compact,  bright  region,  located 
on  or  near  a  driveway  or  road.  A  vehicle-on-a-roaii 
locale  is  a  subset  near  a  road.  A  vehicle-on-a-driveway 
locale  is  near  a  house  and  a  driveway. 


3.3.  Experimental  Results 

The  current  version  of  the  system  has  been  tested 
on  several  images.  Figures  6-11  show  some  results 
obtained  for  small  parts  of  two  suburban  neighborhoods. 

Figures  6a  and  6b  show  original  and  SNF-enhanced 
images  of  the  first  neighborhood,  while  Figures  6c  and  6d 
show  the  instances  of  roads,  houses,  house-adjuncts,  and 
their  shadows  found  first  by  search  among  the  high- 
contrast  components,  then  accumulated  during  subse¬ 
quent  search  among  low-contrast  components.  Black 
borders  indicate  the  individual  components  of  the  level 
of  segmentation,  which  constitute  a  basis  for  search  by 
merging;  white  borders  indicate  actually  found  instances 
of  different  types.  Figures  7a-d  and  8  show  instances 
found  by  type,  with  their  outlines  overlaid  on  the  origi¬ 
nal  image,  to  give  some  feeling  for  difficulty  and  accu- 


Because  our  Prolog  image  theory  is  rudimentary, 
parts  of  the  image  haven’t  been  fully  interpreted.  For 


Figure  8.  Road  complex. 
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Figure  10.  (a)  Low-contrast  search,  (b)  Houses, 
(c)  Road  complex,  (d)  House  shadows. 


example,  we  know  that  the  house-adjuncts  are  not 
garages  since  they  are  not  approached  by  driveways. 
But  we  need  a  theory  of  shading  to  properly  interpret 
the  whole  structures  of  these  houses.  (Note  that  the  sun 
direction  is  to  the  right,  also  that  the  shadows  of  the 
adjuncts  are  small  compared  to  those  of  simple  houses.) 
Shading  and  shadows  are  very  complex,  being  deter¬ 
mined  by  sun  direction  and  the  geometry  of  shaded  and 
overshadowing  objects.  This  points  to  the  need  for  gen¬ 
eral  “theories"  of  shading  and  overshadowing,  and  of 
occlusion,  even  in  2D  image  interpretation.  This  is  espe¬ 
cially  important  in  reasoning  abcm*  missing  or  conflicting 
visual  evidence,  for  diagnosis  >’i  errors  and  “virtual”  de¬ 
lineation  of  objects  whose  evidence  is  weak. 

Figures  9  11  show  four  houses,  their  shadows,  trees, 
and  the  road;  apparently  there  are  two  vehicles,  one  on  a 
driveway,  the  other  near  the  street  the  first  much 
larger.  Note  that  the  enhanced  image  preserves  the  roof 
structure  very  well.  Note  also  that  the  delineation  of  the 
shadows  of  the  houses  give  useful  information  about  the 


height  and  structure  of  the  houses:  one  house  has  two 
stories,  while  another  is  split-level,  with  the  short  side 
toward  the  vehicle,  perhaps  indicating  an  attached 
garage. 

There  is  a  considerable  amount  of  visual  informa¬ 
tion  even  in  poor  quality  aerial  images  such  as  these  (6 
bits,  40  gray  levels,  with  poor  contrast  and  blur).  We 
are  able  to  detect  and  delineate  objects  which  have  a 
figure-ground  contrast  of  only  one  or  two  gray  levels, 
and  about  which  we  have  to  reason  in  order  to  identify 
them,  however  uncertainly. 

3.4.  Conclusion 

The  first  version  of  the  ,‘I’L  system  has  been  imple¬ 
mented  fully  and  the  results  so  far  have  been  very 
encouraging.  The  performance  of  the  system  can  he 
improved  by  strengthening  our  image  theory  by  adding 
new  types  of  objects  and  relations;  this  will  require  more 
powerful  and  computationally  expensive  image  analysis. 
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Figure  11.  (a)  Trees,  (b)  Two  vehicles  (near  road  and  on  driveway). 


We  also  intend  to  pursue  five  areas  of  research:  (i) 
Development  of  a  parallel  logic-programming  system  for 
implementation  of  PPL  and  other  complex  programs  for 
image  interpretation,  probably  using  our  128  node 
Butterfly  multi-processor,  (ii)  Investigation  of  region- 
based  and  feature-based  stereo  analysis  and  its  relation 
to  image  interpretation,  (iii)  Introduction  of  image 
theories  for  shaded,  overshadowed,  and  occluded  objects, 
(iv)  Implementation  of  object-specific  image  resegmenta¬ 
tion,  especially  to  detect  and  delineate  small  objects  and 
thin  structures,  (v)  Research  on  more  powerful,  heuristic 
search  procedures  based  on  edge  and  region  compatibil¬ 
ity  relations  among  components. 
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ABSTRACT 


Developed  herein  is  a  formal  theory  for  stereo  vision  which  uni¬ 
fies  existing  stereo  methods  and  predicts  a  large  variety  of  stereo 
methods  not  yet  explored.  The  notion  of  ”stereo”  is  defined  ax- 
iomatically  using  terms  which  are  both  general  and  precise  giv¬ 
ing  stereo  vision  a  broader  and  more  rigorous  foundation.  The 
variations  in  imaging  geometry  between  successive  images  used 
in  parallax  stereo  and  conventional  photometric  stereo  techniques 
are  extended  to  stereo  techniques  which  involve  variations  of  arbi¬ 
trary  sets  of  physical  imaging  parameters.  Physical  measurement 
of  visual  object  features  is  defined  in  terms  of  solution  loci  in  fea¬ 
ture  space  arising  from  constraint  equations  that  model  the  phys¬ 
ical  laws  that  relate  the  object  feature  to  specific  image  features. 
Ambiguity  in  physical  measurement  results  from  a  solution  locus 
which  is  a  subset  of  feature  space  larger  than  a  single  measurement 
!  point.  Stereo  methods  attempt  to  optimally  reduce  ambiguity  of 

|  physical  measurement  by  intersecting  solution  loci  obtained  from 

i  successive  images.  A  number  of  examples  of  generalized  stereo 

'  techniques  are  presented.  This  new  conception  of  stereo  vision  of- 

,  fers  a  new  perspective  on  many  areas  of  computer  vision  including 

|  areas  that  have  not  been  previously  associated  with  stereo  vision 

I  (e.g.  color  imagery). 

f  As  the  central  focus  of  generalized  stereo  vision  methods  is 

t  on  measurement  ambiguity  mathematical  developments  are  pre- 

i  sented  that  characterize  the  ’’size”  of  measurement  ambiguity  as 

|  well  as  the  conditions  under  which  disambiguation  of  a  solution 

t  locus  takes  place.  The  dimension  of  measurement  ambiguity  at 

a  solution  point  is  defined  using  the  structure  of  a  differentiable 
manifold  and  an  upper  bound  is  established  using  the  implicit 
function  theorem.  The  symmetry  group  of  bijective  transforma¬ 
tions  of  feature  space  into  itself,  which  leave  a  solution  locus  in¬ 
variant,  are  used  to  analyze  the  ambiguity  of  a  solution  locus.  A 
purely  group  theoretic  characterization  of  the  conditions  under 
which  measurement  disambiguation  takes  place  is  given. 

1  INTRODUCTION 

It  is  felt  that  the  existing  set  of  vision  techniques  which  are  as¬ 
sociated  with  the  notion  of  “stereo”  reflect  a  conceptual  basis 
i  for  stereo  vision  which  is  much  too  limited.  For  the  most  part 

stereo  vision  implies  the  traditional  techniques  of  parallax  stereo 
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which  measures  the  depth,  or  more  generally,  the  3D  coordinate 
position  of  a  point  on  an  object  surface,  from  the  intersection  of 
distinct  rays  of  projection  onto  image  planes.  Clever  variations 
on  parallax  stereo  such  as  with  intensity  gradient  light  striping  in 
(Schwartz  1983]  and  with  one-eyed  stereo  in  [Strat  and  Fischler 
1985]  ultimately  must  rely  on  conceptually  equivalent  triangula¬ 
te"  methods  *e  measure  the  position  of  points. 

A  set  of  techniques  to  measure  local  surface  orientation  us¬ 
ing  a  Lambertian  reflectance  map  for  material  surfaces  was  in¬ 
troduced  in  [Woodham  1978]  and  named  photometric  stereo.  Al¬ 
though  completely  distinct  from  the  techniques  of  parallax  stereo, 
the  techniques  of  photometric  stereo  also  rely  upon  a  sequence  of 
successive  images  to  compute  local  surface  orientation.  Unlike 
parallax  stereo,  the  techniques  of  photometric  stereo  keep  the  im¬ 
age  plane  static  and  vary  the  incident  orientation  of  a  single  light 
source  between  successive  images.  Also,  in  photometric  stereo  im¬ 
age  irradiance  at  a  corresponding  pixel  is  extracted  rather  than 
2D  image  coordinate  position  as  in  parallax  stereo. 

Another  set  of  stereo  techniques  to  measure  local  surface  ori¬ 
entation  using  a  more  physically  accurate  reflectance  map  was 
introduced  in  (Wolff  1987a].  The  single  incident  light  source  and 
the  image  plane  are  held  static  between  successive  images.  Only 
variations  in  the  incident  wavelength  and/or  polarization  of  the 
light  source  are  made  between  successive  images.  As  in  photo¬ 
metric  stereo  the  image  irradiance  at  the  same  specified  pixel  in 
each  successive  image  produces  a  distinct  equireflectance  curve  in 
gradient  space.  The  intersection  poiot(s)  for  all  equireflectance 
curves  is  the  measurement  of  local  surface  orientation. 

The  existence  of  photometric  stereo  as  well  as  spectral  and 
polarization  stereo  techniques  which  fall  in  the  same  category  of 
techniques  as  parallax  stereo  under  the  intuitive  idea  of ’’stereo” 
suggests  a  much  broader  foundation  for  stereo  vision.  Hut  then 
how  can  a  stereo  vision  technique  be  formally  defined  or  charac¬ 
terized  ?  Is  there  a  cohesive  framework  within  which  to  describe 
all  existing  stereo  techniques  ? 

Informally,  a  stereo  vision  technique  performs  a  quantitative 
physical  measurement  of  some  kind  of  visual  object  featute  using 
a  sequence  of  successive  images  such  that  between  each  succes¬ 
sive  image  at  least  one  parameter  of  the  imaging  system  is  varied. 
The  visual  object  feature  is  not  directly  observed,  rather,  only- 
image  functions  are  empirically  ascertained  and  then  certain  well 
defined  operations  are  performed  on  each  image  function.  These 
operations  on  image  functions  are  simple  for  the  cases  of  parallax 
and  photometric  stereo  and  spectral  and  polarization  stereo.  As¬ 
suming  that  correspondence  has  already  been  solved  for  a  given 
parallax  stereo  technique  measuring  the  3D  position  of  an  object 
point,  the  image  operation  performed  extracts  from  each  sucres 
sive  image  the  2D  image  coordinate  position  of  the  correspond 
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ing  pixel.  For  photometric  stereo  and  spectral  and  polarization 
stereo  the  image  irradiance  is  extracted  at  the  corresponding  pixel. 
Other  stereo  techniques  have  been  developed  which  require  more 
elaborate  image  operations. 

After  extracting  appropriate  numerical  values  using  an  image 
operation,  the  physical  measurement  of  a  visual  object  feature  re¬ 
quires  the  solution  of  a  system  of  one  or  more  equations  which 
relate  these  extracted  numerical  values  to  the  visual  object  fea¬ 
ture.  This  system  of  equations  which  arises  for  each  successive 
image  can  be  interpreted  as  the  camera  model  with  respect  to 
the  given  object  feature.  In  the  case  of  traditional  parallax  stereo 
the  system  of  equations  arising  from  each  image  are  the  two  per¬ 
spective  projection  equations  relating  the  corresponding  2D  image 
coordinate  position  to  the  3D  world  coordinate  position  visual  fea¬ 
ture.  For  photometric  stereo  and  spectral  and  polarization  stereo 
techniques  a  single  image  irradiance  equation  arises  for  each  suc¬ 
cessive  image  relating  the  image  irradiance  at  the  corresponding 
pixel  to  the  reflected  scene  radiance  at  the  point  on  the  object  at 
which  local  surface  orientation  is  to  be  measured. 

For  a  given  stereo  vision  technique  the  solutiou(s)  of  the  sys¬ 
tem  of  equations  for  each  successive  image  can  be  interpreted  as  a 
solution  locus  of  points  in  a  feature  vector  space  which  is  spanned 
by  the  physically  independent  parameters  used  to  represent  the 
visual  object  feature  to  be  measured.  A  set  of  object  or  imaging 
parameters  are  said  to  be  physically  independent  if  the  physical 
variation  of  the  value  for  any  one  parameter  does  not  effect  the 
values  for  any  of  the  other  parameters.  Valid  measurement  points 
exist  only  at  the  mutual  intersection  of  all  solution  loci  obtained 
from  all  successive  images.  That  is,  a  point  on  a  given  solution 
locus  is  a  measurement  point  with  respect  to  the  sequence  of  suc¬ 
cessive  images  taken  for  a  stereo  vision  technique  if  and  only  if 
this  point  is  "corroborated”  as  being  on  the  solution  loci  for  all 
successive  images  with  respect  to  variations  that  are  performed  on 
a  preselected  set  of  imaging  parameters.  The  essence  of  physical 
measurement  by  stereo  vision  techniques  is  to  isolate  valid  mea¬ 
surement  points  from  the  disambiguation  of  intersecting  solution 
loci. 

The  resulting  solution  loci  from  each  successive  image  for  tra¬ 
ditional  parallax  stereo  are  lines  in  a  3-dimensional  Euclidean  fea¬ 
ture  space  of  world  coordinate  position  points.  Photometric  stereo 
and  spectral  and  polarization  stereo  techniques  generate  solution 
loci  which  are  one  dimensional  equireflectance  curves  in  a  feature 
space  which,  for  example,  can  be  2-dimensional  gradient  space 
representation  for  surface  orientation.  From  this  more  general 
perspective,  it  is  theoretically  possible  for  there  to  exist  stereo  vi¬ 
sion  techniques  which  isolate  valid  measurement  points  from  the 
intersection  of  multi-dimensional  solution  loci  in  a  higher  dimen¬ 
sional  feature  space.  In  fact  a  stereo  technique  in  [Wolff  1987b], 
and  summarized  in  this  paper,  to  measure  surface  curvature  in¬ 
tersects  2  dimensional  solution  loci  in  a  5  dimensional  feature 
space.  The  mathematics  of  disambiguation  in  this  case  can  be 
rather  abstract  and  complicated,  and  establishing  useful  tools  to 
quantify  the  size  of  measurement  ambiguity  as  well  as  the  con¬ 
ditions  under  which  disambiguation  takes  place  occupies  much  of 
the  development  in  this  paper. 

In  summary,  four  distinct  quantities  are  seen  to  characterize  a 
stereo  vision  technique  at  the  level  of  unifying  all  existing  stereo 
techniques.  The  first  is  the  visual  object  feature  to  be  measured. 
The  second  is  the  image  operation  performed  on  each  successive 
image  extracting  appropriate  numerical  values.  The  third  is  a 
preselected  set  of  imaging  parameters  of  which  at  least  one  pa¬ 
rameter  is  varied  between  each  successive  image.  The  fourth  is 


the  system  of  camera  modeling  equations  which  relate  the  first 
three  quantities  together.  These  four  quantities  will  be  formally 
defined  in  section  2. 

For  a  given  generalized  stereo  vision  technique,  the  physical 
measurement  produced  from  the  mutual  intersection  of  solution 
loci  in  an  arbitrary  multi-dimensional  feature  space  is  said  to  be 
ambiguous  if  the  intersection  consists  of  more  than  a  single  point. 
The  term  ambiguity  does  not  in  any  way  imply  experimental  er¬ 
ror.  Rather  the  ambiguity  of  the  quantitative  measurement  of 
visual  object  features  arises  from  the  nonunique  solution  to  the 
equations  that  relates  the  value  of  the  object  feature  to  functions 
of  the  corresponding  image  pixel  values.  A  solution  locus  which 
consists  of  more  than  a  single  point  in  feature  space  is  consid¬ 
ered  to  be  an  ambiguous  physical  measurement  since  there  is  an 
uncertainty  in  the  value  of  the  object  feature.  This  ambiguity 
of  physical  measurement  exists  independent  of  experimental  error 
even  though  nonzero  experimental  error  clearly  exacerbates  the 
uncertainty  of  the  state  of  an  object  feature  with  a  further  kind 
of  ambiguity. 

Certain  measures  of  measurement  ambiguity  are  proposed  in 
this  paper  assuming  that  connected  components  of  the  intersec¬ 
tion  of  solution  loci  form  manifolds.  A  global  measure  of  measure¬ 
ment  ambiguity  simply  integrates  volume  elements  of  appropriate 
dimensionality  over  the  entire  mutual  intersection  of  the  solution 
loci.  A  local  measure  of  measurement  ambiguity  utilizes  the  im¬ 
plicit  function  theorem  applied  to  the  camera  model  equations  for 
the  fourth  axiom  of  generalized  stereo  to  compute  an  upper  bound 
on  the  dimension  of  measurement  ambiguity  at  a  point.  This  char¬ 
acterizes  the  dimension  of  a  local  neighborhood  of  a  point  on  the 
mutual  intersection  of  solution  loci. 

Suppose  for  a  given  generalized  stereo  technique  that  the  first 
image  produces  the  solution  locus  Mi  in  feature  space.  Let  the 
solution  locus  resulting  from  the  second  image  be  A/2.  The  so¬ 
lution  locus  Mi  is  said  to  disambiguate  solution  locus  Mi  if  and 
only  if  Mi  n  Mi  £  Mx.  This  paper  presents  a  theoretical  char¬ 
acterization  of  the  conditions  under  which  the  solution  locus  Mi 
is  disambiguated  by  another  successive  image.  This  is  done  using 
the  notion  of  the  symmetry  of  a  solution  locus,  precisely  expressed 
in  the  language  of  group  theory. 

A  symmetry  for  a  solution  locus  in  feature  space  is  a  one-to- 
one,  onto  transformation  of  feature  space  coordinates,  called  an 
automorphism,  that  preserves  the  set  of  points  occupied  by  the 
solution  locus  when  it  is  transformed.  For  photometric  stereo  and 
spectral  and  polarization  stereo  the  equireflectance  curves  in  gra¬ 
dient  space  possess  a  flip  symmetry  about  a  specified  line.  The 
feature  space  coordinate  transformation  that  represents  this  sym¬ 
metry  may  move  solution  locus  points  around,  but  the  end  result 
is  that  the  transformed  solution  locus  looks  exactly  the  same  as 
the  original  solution  locus.  A  useful  property  of  symmetries  for  a 
solution  locus  is  that,  with  respect  to  composition  of  transforma¬ 
tions,  they  form  a  group  which  in  turn  is  a  subgroup  of  the  set  of 
all  transformations  of  feature  space  into  itself. 

The  variation  of  imaging  parameters  between  successive  im¬ 
ages  induces  a  coordinate  transformation  on  feature  space,  which 
will  be  called  a  stereo  automorphism.  If  the  first  image  produces 
solution  locus  M \  and  the  second  image  produces  solution  locus 
Mi ,  then  the  stereo  coordinate  automorphism  induced  by  the  vari¬ 
ation  of  imaging  parameters  between  the  first  and  second  images 
must  transform  the  point  set  Mi  into  the  point  set  A f2.  Clearly  if 
the  corresponding  stereo  automorphism  coincides  with  a  symme¬ 
try  automorphism  of  Mi,  then  Mi  =  M2  and  no  disambiguation 
takes  place.  It  is  shown  that  in  fact  disambiguation  cannot  take 
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place  for  a  much  larger  set  of  stereo  automorphisms.  This  paper 
establishes  a  theorem  that  states  that  disambiguation  of  M\  can 
take  place  if  and  only  if  the  stereo  automorphism  induced  by  the 
variation  of  imaging  parameters  between  successive  images  is  out¬ 
side  the  normalizer  subgroup  of  the  symmetry  subgroup  of  M i ,  in 
the  group  of  all  automorphisms  of  feature  space  into  itself. 

2  GENERALIZED  STEREO 

Before  presenting  an  axiomatic  definition  of  stereo  vision  in  terms 
of  a  formalization  of  the  four  quantities  visual  object  feature,  image 
operation,  imaging  parameters  and  camera  modeling  equations, 
some  development  needs  to  be  given  towards  the  formalization  of 
an  image  operation.  Then,  after  formally  defining  stereo  vision  in 
a  general  way,  a  number  of  stereo  vision  techniques  will  be  dis¬ 
cussed  in  context  with  this  definition.  Many  of  these  stereo  vision 
techniques  are  new  and  have  only  been  presented  in  publications 
dealing  with  different  phases  of  this  research. 

In  general  an  s  color  image  can  be  represented  by  an  s-tuple 
image  function,  I,,  which  is  considered  to  be  a  function  from  IR2 
into  the  set  of  real  s-tuples,  (ci,C2, . . .  ,c,),  for  s  >  1.  An  image 
operation  upon  an  s-tuple  image  function  can  produce  a  f-tuple 
image  function  for  any  I  >  1.  So  for  instance  a  gradient  image 
operation  upon  a  differentiable  1-tuple  image  function  produces  a 
2-tuple  image  function  of  IR2  mapped  into  real  gradient  vectors  in 
2  dimensions.  An  image  operation  therefore  defines  a  functional 
from  the  set  of  s-tuple  image  functions  into  the  set  of  (-tuple  image 
functions.  For  such  an  image  operation  functional,  F ,  acting  on 
the  s-tuple  image  function,  I,,  the  evaluation  of  the  real  (-tuple  at 
the  point  (i.e.  pixel)  p  £  DR.2  for  the  corresponding  (-tuple  image 
function  is  denoted  by  \p.  This  will  be  called  a  point  valued 

image  operation  functional. 

A  more  general  image  operation  functional  which  can  evaluate 
real  (  tuples  over  arbitrary  point  regions  can  be  defined  using  the 
power  set,  /'( IR2),  of  IR2.  This  type  of  image  operation  functional 
maps  5-tuple  image  functions  into  functions  from  /’(IR2)  into  real 
Mi  pies.  An  example  of  such  a  functional  assigns  an  5-tuple  image 
function,  I3,  to  a  function  from  /’(IR2)  to  the  1-tuple  area  value  of 
each  subset  of  pixel  points  3.  Another  example  of  such  a  functional 
assigns  an  5- tuple  image  function,  /,,  to  a  function  from  /*(IR2) 
to  the  5-tuple  which  represents  the  average  over  each  subset  of 
pixel  points.  Analogous  to  the  definition  of  notation  above,  if  F 
acts  on  the  s-tuple  image  function  /,,  the  evaluation  of  the  real 
t  tuple  at  the  region  A  C  IR2  for  the  corresponding  /-tuple  image 
function  is  denoted  by  Ft{  13)\a-  This  will  be  called  a  region  rallied 
image  operation  functional  even  though  it  is  understood  that  the 
set  of  all  regions  include  point  regions  as  well. 

The  format  axiomatic  definition  of  stereo  vision  states  that  a 
stereo  vision  technique  is  completely  defined  by  the  following  four 
quantities: 


'  !  t .tip  fun.  tn'Ticil  impl.es  a  fun*  i  ion  from  one  *•*•1  of  fu  m  1 1  -  .ns  to  *n«*t  li* 
'<■*  of  fun*  eons 

-ir'-*  vain*’  am  )  ••  *l*-fm*'*t  as  t>*  inu  »h--  l«.»  -ii m*  noon  ■ 
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•  Visual  object  feature  to  be  measured  represented  as  a  real  n- 
tuple  of  physically  independent  parameters  (ij,  12,  . . .  ,z„). 

•  A  well  defined  image  operation  functional  on  the  set  of  s- 

tuple  image  functions  into  the  set  of  (-tuple  image  functions, 
Ft(Ia )\p  for  point  valued,  for  region  valued,  s,t  >  1. 

•  A  preselected  set  of  physically  independent  image  system 
parameters,  (aj ,  02, . . .  ,am ),  at  least  one  of  which  is  varied 
between  each  successive  image. 

•  Camera  modeling  equations  involving  the  quantities  defined 
in  the  first  three  axioms: 


$1(0,22 . xn,a(1<:,,a2t\...,am),JF((*)(/J)US  =  0 

$2(0, X2,-.  ■  ,xn,a[k),a[k) . a(m\ /r,(*)(/s)|/t)  =  0 

$(1(21,12,  •••  ,2n,  <4*\  . .  .  ,Om\  /'/*'*( /s)m  =  0 

the  superscript  (k)  implying  a  dependence  on  each  succes¬ 
sive  image.  A  point  valued  image  operation  functional  could 
of  just  as  well  been  used. 


To  distinguish  from  stereo  vision  techniques  whic'.i  are  related  ex¬ 
clusively  to  parallax  stereo  techniques,  the  complete  set  of  stereo 
techniques  which  are  defined  by  the  four  axioms  above  will  he 
called  generalized  <  . .  o  techniques.  The  real  valued  n- tuples  and 
m-tuples  from  the  first  and  third  axioms  above  will  also  be  termed 
object  feature  and  imaging  system  states  respectively. 

Even  more  generally,  a  different  image  operation  functional 
can  be  used  for  each  successive  image,  but  particular  generalized 
stereo  methods  of  this  kind  will  not  be  presented  here.  Examples 
of  generalized  stereo  techniques  presented  below  have  a  constant 
number  of  equations,  h  >  1,  produced  from  each  successive  im¬ 
age.  This  definition  of  generalized  stereo  excludes  any  auxiliary 
constraints  provided  by  scene  modeling,  scene  conditioning  (e.g. 
light  striping)  or  any  a  priori  assumptions. 

For  (l  +  1)  successive  images  there  will  be  h(l  +  1)  equations 
in  all.  As  implied  by  equations  1  t lie  physical  state  of  the  object 
feature  to  he  measured  is  assumed  to  remain  invariant  while  the 
physical  state  of  the  imaging  system  is  assumed  to  lie  altered 
between  each  successive  image.  Let  A/<)  be  the  solution  locus 
obtained  from  the  first  image  and  let  A/,  be  the  solution  locus 
obtained  from  the  ( i  +  1  )st  successive  image  .  r  1  <  1  <  /  Flien 
the  solution  locus  A/0  is  said  fo  be  disambig  ,tfd  with  respect  to 
the  stereo  sequence  of  (l-t-1)  successive  imag-s  if  and  only  it 


n!=„.U,  £  A/, 1 . 


Tit**  "size”  of  the  resulting  solution  lorus  a. id  the  conditions 
under  which  a  solution  loru.s  can  be  disambiguated  arc  topics 
discussed  in  sections  d  and  l. 

I  he  definition  of  generalized  stereo  not  only  serves  as  a  precise 
unifying  framework  for  existing  'Stereo*'  techniques  }>ul  predicts  a 
v.i.st  variety  of  techniques  which  have  not  yet  been  explored.  Some 
-if  these  new  generalized  stereo  te,  hniqnes  may  be  id  significant 
piactica!  and  < v-  ,i  biological  importance.  Presented  now  are  a 
r.umber  of  sterei,  ter  huiques  «.i»me  of  whir  h  are  familiar  and  »■  me 
-  1  v.  b  u  h  a  re  new 
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EXAMPLE  1  (PARALLAX  STEREO) 

This  is  the  original  stereo  vision  algorithm  that  has  been  in  ex¬ 
istence  for  thousands  of  years  and  is  the  algorithm  that  is  usually 
implied  when  the  term  ’’stereo”  is  mentioned  [Ballard  and  Brown 
1982].  As  discussed  in  the  introduction,  the  visual  object  feature 
to  be  measured  is  the  depth,  or  more  generally,  the  three  dimen¬ 
sional  world  spatial  position  of  a  point  on  an  object  demarcated 
by  some  prominence  like  an  edge  marking.  A  pixel  valued  2-tuple 
image  operation  functional  is  used  which  maps  any  image  function 
into  a  function  which  maps  each  pixel  in  ffi.2  (the  image  array  of 
pixels)  into  the  2-tuple  representing  its  two  dimensional  location 
using  a  given  image  coordinate  system.  The  2-tuple  valued  im¬ 
age  operation  functional  is  evaluated  at  the  pixel  corresponding 
to  the  manifestation  of  the  object  point  in  the  image  and  assum¬ 
ing  the  projection  mapping  depicted  in  figuic  1,  two  perspective 
equations  arise  for  each  successive  image.  The  preselected  phys¬ 
ical  parameters  to  be  varied  between  successive  images  are  the 
three  parameters  specifying  the  world  spatial  position  of  the  focal 
point  which  are  represented  by  a^)  for  the  kth  succes¬ 

sive  image,  being  consistent  with  the  notation  in  equations  (1). 
If  the  image  coordinates  of  the  object  point  in  the  k,h  successive 
image  are  (vSk\ «(*)),  the  focal  length  is  always  f  and  assuming 
no  vergence,  then  the  resulting  equations  are 


which  not  surprisingly  is  the  intersection  of  two  planes  in  the  three 
dimensional  feature  space  of  world  position  coordinates.  In  fact 
two  successive  images  overconstrain  the  poi:.‘.  solution  with  the 
intersection  of  four  planes. 

Auxiliary  scene  conditioning  such  as  light  striping  [Schwartz 
1983]  adds  an  auxiliary  planar  solution  locus  in  feature  space  to 
the  intersection  of  two  planes  produced  from  a  single  image  to  get 
a  unique  solution  point.  Another  monocular  stereo  technique  is 
employed  by  [Strat  and  Fischler  1985]  who  generate  a  second  hy¬ 
pothesized  virtual  image  from  an  appropriate  scene  model.  This 
also  adds  an  auxiliary  solution  locus  in  the  world  feature  space  to 
obtain  a  unique  solution  point  from  a  single  image.  These  meth¬ 
ods  are  not  considered  here  to  be  pure  stereo  methods  because 
they  employ  auxiliary  assumptions  or  constraints. 

EXAMPLE  2  (PHOTOMETRIC  STEREO) 

This  is  a  stereo  technique  first  proposed  in  [Woodham  1978]  to 
determine  trip  shapes  of  material  surfaces  assuming  a  Lambertian 
reflectance  map.  Related  stereo  techniques  are  used  in  [Silver 
1980]  and  [Ikeuchi  1987).  The  visual  object  feature  to  be  mea¬ 
sured  is  local  surface  orientation  and  the  image  operation  func¬ 
tional  used  is  simply  the  identity  mapping  from  1  tuple  image 
irrailiance  functions  on  IR2  to  the  same  1-tuple  image  irradiance 
functions  on  DR.2.  That  is.  all  that  needs  to  be  evaluated  is  the 
image  irradiance  value  /•**  at  the  corresponding  pixel  in  the  k'h 
successive  image.  The  physical  state  of  the  imaging  system  to  be 
altered  is  the  incident  light  orientation  represented  by  the  2  tuple 
i  u i*’.  1 )  denoting  gradient  space  coordinates  ( p.  q  )  respect iv<  ly. 

I  he  physical  state  of  the  surface  orientation  { x i .  r2 )  will  be  solved 
for  in  a  two  dimensional  feature  space  which  in  gradient  space  co¬ 
ordinates  yields  the  following  equation,  assuming  a  Lambertian 
reflectance  map.  for  the  k'k  successive  image 


p(  1  +  atiflj  +  x  20;  ) 


n  +  (a<fe))2  +  (a<*>)2 


-  I{k)  =  0  (2) 


with  albedo  p.  The  resulting  solution  locus  from  each  successive 
image  is  commonly  referred  to  as  an  equireflectance  curve. 

EXAMPLE  3  (TEXEL  STEREO) 

Even  though  stereo  techniques  for  the  determination  of  shape 
from  texture  have  never  been  formally  specified  it  is  a  small  con¬ 
ceptual  jump  from  the  works  of  [Render  1980],  [Ohta  et  al.  1980] 
and  [Aloimonos  and  Swain  1985],  noting  that  texel  area  elements 
exhibit  the  same  behavior  as  the  reflected  intensity  from  a  Lam¬ 
bertian  surface,  that  a  texel  stereo  technique  equivalent  to  that 
of  photometric  stereo  could  be  developed  by  modifying  the  image 
operation  functional  used  and  the  preselected  set  of  physical  pa¬ 
rameters  to  be  varied.  Of  course  the  visual  object  feature  to  be 
measured  is  local  surface  orientation.  The  region  valued  1-tuple 
image  operation  functional  to  use  was  mentioned  earlier  and  as¬ 
signs  image  functions  to  the  function  that  assigns  every  subset  of 
pixels  the  area  covered  by  that  subset.  The  physical  parameters 
to  vary  specify  viewer  orientation  and  is  represented  in  gradient 
space  coordinates  ( a[k\a for  the  kth  successive  image.  As¬ 
suming  the  actual  surface  area  of  the  texel  to  take  the  place  of 
the  albedo  p  and  the  evaluation  of  the  1-tuple  image  operation 
functional  at  the  appropriate  region  of  pixels  is  A^k\  the  equation 
that  arises  in  gradient  feature  space  for  the  k,hi  successive  image 
is  the  same  as  for  equation  2  with  A***  substituted  for  T**. 


EXAMPLE  4  (SPECTRAL  AND  POLARIZA¬ 
TION  STEREO) 

The  stereo  techniques  presented  so  far  alter  the  physical  state 
of  the  imaging  system  solely  with  respect  to  some  aspect  of  the 
imaging  geometry.  First  proposed  in  [Wolff  1986]  and  also  re¬ 
ported  in  (Wolff  1987a]  are  stereo  techniques  that  measure  lo¬ 
cal  surface  orientation  of  material  surfaces,  using  an  accurate  re¬ 
flectance  model,  by  only  varying  the  incident  wavelength  and/or 
linear  polarization  of  a  point  light  source  while  leaving  the  imaging 
geometry  completely  invariant.  The  image  operation  functional 
used  is  the  same  as  for  photometric  stereo,  namely  image  irradi¬ 
ance  values  are  measured  in  successive  images. 

The  reflectance  model  that  is  used  was  proposed  in  [Torrance 
and  Sparrow  1967]  and  is  applicable  to  a  wide  variety  of  isotropi¬ 
cally  rough  homogeneous  dielectric  and  conducting  surfaces.  The 
form  of  the  TorranceSparrow  reflectance  function  is 


p{  gloR »  +  } 

where  the  terms  R,  and  Rj  are  the  functions  for  the  specular  and 
diffuse  components  of  reflection  respectively,  the  term  / o  repre¬ 
sents  the  incident  radiance  of  the  light  source,  p  is  the  albedo  and 
g  is  the  amplitude  proportion  of  the  specular  reflectance  relative  to 
the  diffuse  reflectance  function.  The  detailed  forms  for  the  func¬ 
tions  R,  and  Rj  ran  be  found  in  (Wolff  1987a],  To  be  consistent 
with  the  unified  notation  in  equations  (1)  the  Torrance  Sparrow 
reflectance  function  will  be  represented  as 


p{  <lln{  a ,  hL  v(n2))+(l  -a,  v( a2))  ]r,( t,  ,  x2.  p,.  q.) 

+  /o/L|i  r  i .  .r2.  ' 

(31 


with  aj  varying  between  0  and  1  inclusive  being  the  physical  pa¬ 
rameter  which  specifies  incident  linear  polarization,  aj  being  the 
incident  wavelength  and  (p,,q,,—  1)  being  the  incident  light  ori¬ 
entation  in  gradient  space  coordinates.  So  for  instance  a  RGB 
color  image  could  be  interpreted  as  a  stereo  image  with  respect 
to  the  superimposition  of  three  successive  images  taken  for  a2  set 
to  a  red,  green  and  blue  wavelength  respectively. 

The  exact  forms  for  r,(ii,i2,p,,g,)  and  Rd(x\,x2,p,,q,)  are 
extraneous  to  the  discussion  here.  What  is  important  to  note  is 
that  the  expression  for  the  reflected  scene  radiance  in  3  is  non- 
linearly  dependent  upon  the  physical  parameters  aj  and  <23.  The 
Fresnel  reflection  coefficients  Fi_('^\Tl(a2))  and  /||(4r',  J^a2))  for 
perpendicular  and  parallel  linearly  polarized  light  respectively  are 
dependent  on  the  term  V  =  $'(p1,gl)  which  is  the  angle  of  inci¬ 
dence  on  a  planar  microfacet  and  equal  to  half  the  phase  angle. 
For  pure  spectral  and  polarization  stereo  methods  this  angle  re¬ 
mains  invariant  as  the  incident  light  orientation  is  unchanged. 
The  Fresnel  reflection  coefficients  are  also  dependent  on  the  term 
t)  =  n  -  in  which  is  the  complex  index  of  refraction  for  the  ma¬ 
terial  surface  and  its  components  are  in  turn  dependent  on  the 
wavelength  parameter  a2  according  to 

„J=^(1  +  [1  +  (5_^_)2)./2) 

K2=-i(_1  +  [1  +  (_^_m) 

where  c0  is  the  speed  of  light  in  vacuo,  r„  is  the  electrical  resistivity 
of  the  surface  material,  and  1/  and  7  are  the  electrical  permitivity 
and  the  magnetic  permeability  of  the  surface  material  respectively. 

If  the  image  irradiance  value  measured  for  the  k,h  successive  image 
is  T*1  the  following  equation  arises  in  gradient  space  solving  for 
the  orientation  feature  state  (11,12) 

p{  gto{  a[k)Fi('S\ •K4*’))+(l-a(i*))^i|(*',*K4*)))  ]r»(*t,X2,P«,9j) 

+  I0Rdlx\.x2.p,,q,)  }-fk]  =  0  (5) 

Figure  1  shows  the  intersection  of  equireflectance  curves  solv¬ 
ing  for  surface  orientation  by  varying  incident  linear  polarization. 

It  was  noted  in  [Wolff  1987a]  for  surface  orientations  not  lying  on 
the  line  in  gradient  space  passing  through  the  origin  and  having 
slope  p,lq,  that  a  two  point  physical  measurement  ambiguity  will 
always  result  regardless  of  how  many  successive  images  are  taken 
varying  the  image  system  state  represented  by  (o,  ,a2**)  speci¬ 
fying  incident  linear  polarization  and  wavelength.  The  reason  for 
this  is  discussed  further  in  section  4.  It  is  noted  that  one  solution 
to  obtain  unique  measurement  is  to  include  one  of  the  param¬ 
eters  p,  or  q,  specifying  incident  light  source  orientation  in  the 
preselected  set  of  physical  parameters  to  alter  the  image  system 
state.  Another  method  which  does  not  alter  the  orientation  of  the 
light  source  is  to  instead  include  a  single  physical  parameter  in 
t  he  preselected  set  specifying  the  solid  angular  subtending  area  of 
a  preset  asymmetric  shape  of  an  ad  justable  diaphram  in  front  of 
an  extended  light  source. 

Equations  1  imply  that  other  physical  parameters  independent 
"r  imaging  geometry  and  incident  wavelength  and  linear  polariza¬ 
tion  can  be  selected  to  obtain  a  measurement  of  local  surface 
orientation.  The  variables  re,  v  and  7  are  dependent  on  temper 
attire.  external  electric  and  magnetic  fields  and  tensile  stress  on 
the  material.  It  is  possible  to  have  stereo  techniques  measuring 
stirfat  <■  orientation  by  preselecting  physical  parameters  represent 
ing  the  alteration  of  the  state  of  the  image  system  with  respect 
to  these  degrees  of  freedom. 
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Figure  1 

In  [Wolff  1987c]  another  polarization  stereo  technique  is  pre¬ 
sented  for  more  robust  practical  circumstances.  Surface  orien¬ 
tation  is  computed  from  knowledge  of  the  incident  state  of  po¬ 
larization  and  the  polarization  image  which  is  derived  from  an 
appropriate  stereo  sequence  of  reflected  radiance  detected  by  the 
sensor  for  different  orientations  of  a  linear  polarizer  placed  in  front 
of  the  sensor. 

EXAMPLE  5  (SURFACE  CURVATURE  FROM 
STEREO) 

With  knowledge  of  the  reflectance  map  of  a  material  surface, 
measurement  of  the  Gaussian  curvature  at  a  point  on  the  surface 
can  be  easily  achieved  from  the  two  previous  examples  of  stereo 
methods  by  first  obtaining  the  normal  map  to  the  surface  over 
a  region  local  to  the  point  and  then  computing  the  change  in 
the  surface  normals  at  the  point  in  two  linear  independent  direc¬ 
tions.  In  effect  the  Gaussian  curvature  is  determined  from  local 
contextual  knowledge  with  respect  to  surface  orientation  in  the 
vicinity  of  the  point  on  the  surface.  In  [Wolff  1987b]  a  photomet¬ 
ric  method  is  proposed  to  determine  the  Gaussian  curvature  at  a 
point  on  a  material  surface  using  measurements  of  the  gradient 
of  the  reflected  scene  radiance  between  successive  images.  The 
contextual  knowledge  used  is  in  the  image  itself  to  compute  the 
image  gradient  from  neighboring  pixels.  In  this  way  neighboring 
surface  orientations  about  the  point  on  the  material  surface  are 
not  required  to  be  computed. 

The  advantages  of  determining  the  Gaussian  curvature  from 
ibis  technique  with  respect  to  measurement  accuracy  are  dis 
cussed  in  (Wolff  1988].  Of  interest  here  is  the  observation  that 
ibis  technique  is  another  example  of  a  generalized  stereo  method 
Parameterizing  the  object  surface  with  respect  to  height  above 
t  he  image  plane  as  (  ri,  f(u.  e)  )  using  image  coordinates  (  u.  r). 


r*.!'  k.  I'Ukj'ilHM.'rtyi'l 


the  object  feature  to  be  measured  is  represented  by  the  5-tuple 


(  df/du,df/dv,d2f/du2,d2f/dv2,d2f/dudv  ) 

which  in  this  example  will  correspond  to  the  point  (ij  ,X2,  x 3 , 14,15) 
in  5  dimensional  feature  space.  The  last  three  components  of  the 
5-tuple  can  be  recognized  as  the  three  parameters  determining 
the  surface  hessian  matrix.  However  from  a  more  general  point 
of  view  the  above  5-tuple  completely  specifies  both  the  first  and 
second  order  rates  of  change  of  the  surface  height  function  /(a,  t>) 
which  are  necessary  to  derive  the  Gaussian  curvature  from  the 
Gauss- Weingarten  equations  [DoCarmo  1976]. 

Evaluated  in  each  successive  image  is  a  point  valued  3-tuple 
image  operation  functional  mapping  the  image  function  I(u,v) 
of  coordinates  (u,v)  in  IR2  to  the  function  mapping  each  point 
(i.e.  pixel)  in  IR2  into  the  3-tuple  {l,dl/du,dl/dv).  From  [Wolff 
1987b],  the  three  equations  that  arise  for  the  ktk  successive  image 


R(xi , x2 , a[k) , a[k) ,0^)  -  I(k> 


. 3K(x,,X},a\  ’.a','1 . _  0 

am xi.Ja.q,  ha,  . am  +  3fl(ii,ra.a)  . _  (  |f  j(*)  _  Q 

(6) 

with  respect  to  the  reflectance  function  /?(i,,  i2,ai,a2, . . . ,  am) 
which  is  shown  to  be  dependent  upon  an  arbitrary  preselected  set 
of  parameters  (at,a2, . . .  ,am)  for  the  image  system  state  implying 
the  use  of  a  variety  of  preselected  parameter  sets  including  those 
used  in  the  stereo  methods  presented  in  the  last  two  examples. 

This  is  a  good  example  of  generalized  stereo  because  it  illus¬ 
trates  just  how  complicated  the  sets  of  solution  loci  can  get  in 
a  higher  dimensional  feature  space.  Assuming  sufficient  differeri- 
tiablity  of  the  reflectance  function  and  a  complete  normed  5  di¬ 


tions  as  shown  in  equations  1.  It  turns  out  for  a  Lambertian  or 
a  Torrance-Sparrow  reflectance  function  that  equations  6  almost 
always  determine  a  2  dimensional  measurement  submanifold  in  5 
dimensional  feature  space. 

3  GLOBAL  AND  LOCAL  MEASURES  OF 
MEASUREMENT  AMBIGUITY 

In  section  2  generalized  stereo  was  presented  as  a  technique  for 
d'«antbiguating  the  measurement  of  the  state  of  a  visual  object 
feature  by  adding  constraints  to  a  solution  locus  in  feature  space. 
Additional  constraints  are  obtained  from  the  physical  equations 
relating  the  evaluation  of  an  image  operation  functional  to  the 
object  feature  state  through  physical  parameters  which  alter  the 
state  of  the  image  system  between  successive  images.  Hopefully 
with  each  successive  image  the  ’’size”  of  the  solution  locus  in  fea¬ 
ture  space  gets  reduced.  But  exactly  what  does  the  term  ’’size” 
mean  ? 

It  is  assumed  for  the  rest  of  this  section  that  the  set  of  feature 
state  n- tuples  (x\,x2, . . .  ,i„)  have  a  Euclidean  norm  defined  in 
a  natural  way  so  that  feature  space  is  identical  to  En.  No  gener¬ 
ality  is  lost  in  the  development  to  follow  since  En  in  an  abstract 
sense  is  a  manifold  unto  itself  and  any  other  complete  normed 
n-dimensiona)  vector  space  structure  would  generate  a  locally  dif- 
feomorphic  manifold. 

3.1  A  Global  Measure 

Ideally  a  generalized  stereo  technique  tries  to  achieve  a  unique 
point,  slate  measurement  of  an  object  feature.  This  may  be  pos¬ 
sible  using  some  preselected  sets  of  physical  imaging  parameters 
and  not  with  others.  If  the  solution  locus  is  not  unique  the  mea¬ 
surement  ambiguity  may  consist  of  a  discrete  number  of  points. 


mensional  vector  space  structure  on  the  set  of  5-tuples  (x,,  x2 . is).],,  worse  cases  the  solution  locus  may  be  a  surface  or  the  union  of 

each  equation  in  the  system  of  equations  6  determines  a  4-dimensional  |nanv  disconnected  component  surfaces  of  many  dimensions. 


sub-manifold  solution  locus.  The  sub-manifold  for  the  first  equa¬ 
tion  in  this  set  can  be  mentally  visualized  as  the  ’’sweeping  out” 
of  a  1  dimensional  oquireflectance  curve  embedded  in  the  2  di¬ 
mensional  subspace  spanned  by  Xj  and  X2  (i.e.  gradient  space), 
in  the  three  additional  dimensions  spanned  by  X3,  x4  and  X5  in 
the  directions  of  their  respective  axes.  The  la^t  two  equations  in 
this  set  can  be  visualized  as  the  ’’sweeping  out”  of  2  dimensional 
planes  embedded  in  the  three  dimensional  subspace  spanned  by 
x;t,  x4  and  x.->,  whose  coefficient  are  determined  by  the  partial 
derivatives  of  the  reflectance  function  with  respect  to  xj  and  x2. 
This  time  the  ’’sweeping  out”  of  the  planes  as  xi  and  X2  vary  could 
be  very  complicated  as  the  partial  derivatives  of  the  reflectance 
function  is  usually  dependent  on  x\  and  X2.  Therefore  the  planes 
undergo  '‘turns”  and  "twists”  as  they  are  being  swept  out. 

10 veil  more  complicated  is  visualizing  the  solution  locus  that 
is  determined  simultaneously  from  the  complete  system  of  equa¬ 
tions  f>.  Specifying  this  solution  locus  is  important  for  chaiacter 
i/mg  the  measurement  ambiguity  associated  with  determining  the 
■V tuple  feature  state  from  a  single  image.  An  important  charac¬ 
teristic  of  the  solution  locus  at  a  point  is  its  dimrnsion  which  will 
be  explained  in  more  detail  in  section  X  Intuitively  the  higher 
the  dimension  of  the  solution  locus  at  a  point,  the  more  ambigu¬ 
ity  there  is  inherent  to  the  feature  state  measurement.  As  will 
be  seen  in  section  3  the  implicit  function  throw  m  provides  an  effi¬ 
cient  algorithm  for  determining  an  upper  bound  on  the  dimen:  ion 
of  a  solution  locus  at  a  point,  determined  bv  a  general  set  of  equa 


In  the  case  of  a  discrete  number  of  points,  the  cardinality 
of  this  po’nt  set  is  clearly  a  good  characterization  of  the  ’’size” 
of  the  resulting  measurement  ambiguity.  In  the  case  of  a  solu¬ 
tion  locus  comprised  of  one  or  many  multidimensional  surfaces, 
the  integration  of  infinitesimal  volume  elements4  of  appropriate 
dimensionality  over  the  entire  extent  of  the  solution  locus  charac¬ 
terizes  the  ’’size”.  Using  this  definition  of ’’size”,  a.  solution  has 
finite  ’’size”  if  and  only  if  the  solution  locus  is  compact  since  solu¬ 
tion  loci  are  closed  subsets  of  En .  Assume  that  eit  her  l>v  uial  and 
error  or  by  some  more  intelligent  algorithm  that  the  "size”  of  the 
solution  locus  resulting  from  a  sequence  of  successive  images  has 
born  minimized,  hopefully  to  some  finite  value  corresponding  to 
a  compact  region  of  feature  space.  Call  this  a  preliminary  global 
bounding. 

3.2  A  Local  Measure:  The  Dimension  Of  Measure¬ 
ment  Ambiguity  At  A  Point 

After  a  preliminary  global  bounding  lias  taken  place,  the  next  step 
in  reducing  measurement  ambiguity  is  to  reduce  the  dimension  at 
all  points  of  the  resulting  solution  locus  within  the  preliminary 
global  bounding  region.  The  dimensionality  of  the  surface  in  the 
local  vicinity  of  a  point  on  the  solution  locus  is  considered  to  be  a 
local  measure  of  the  ’’size"  of  the  measurement  ambiguity  at  that 
point.  This  subsection  is  concerned  with  evaluating  the  dimension 
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of  solution  loci  at  individual  points  as  a  characterization  of  the 
’’size”  of  measurement  ambiguity. 

To  define  the  dimension  of  a  solution  locus  at  a  point  in  fea¬ 
ture  space  the  concept  of  a  £?•’>  manifold  of  dimension  r  in  n- 
dimensional  Euclidean  space  En  is  introduced. 

DEFINITION:/!  nonempty  set  Mc£"  is  a  C(q)  manifold  of 
dimension  r. 

1  <  r  <  n,  if  M  has  the  property  that  for  every  x0  6  M  there  ex¬ 
ists  on  open  set  U  C  En  containing  x0  and  an  open  set  V  C  ET 
such  that 


(i)  there  exists  a  one-to-one  function  f>{y)  =  ■  ■  ■  *<t>n)  for 

Y  _  V  n  M  which  is  CM  differentiable  meaning  that  the  func¬ 
tion  produced  by  q  >  1  differentiations  w.r.t.  any  yu  J/2,  ■  ■  ■ ,  Vr  for 
each  Ot  exists  and  is  continuous. 


(ii)  for  all  y  G  V  the  Jacobian  3<b(y)  is  one-to-one. 


A  C*7*  manifold  of  dimension  r  will  simply  be  referred  to  as 
an  r-manifold.  Intuitively  from  a  topological  point  of  view  an  r- 
manifold  is  locally  indistinguishable  from  Er  in  a  sufficiently  small 
region  about  any  point.  Condition  (i)  induces  a  local  coordinati- 
zation  in  an  open  neighborhood  about  any  point  on  the  r-manifold 
M  with  respect  to  the  r-dimensional  coordinates  of  Er .  Condition 
(ii)  assures  the  existence  of  an  r-dimensional  tangent  space  at  ev¬ 
ery  point  on  M.  It  is  clear  from  condition  (ii)  that  the  determinant 
of  J <t>(y)  is  nonzero  for  all  y  £  V. 

The  dimension  of  measurement  ambiguity  at  a  point  x  on  a 
solution  locus  M  in  En  is  an  integer  r,  such  that  1  <  r  <  n,  if 
there  exists  an  open  set  W  C  En  containing  x  such  that  W  n  M 
is  an  r-manifold  in  En.  Indeed  this  implies  that  the  dimension  of 
measurement  ambiguity  may  be  undefined  at  some  points.  The 
dimension  of  measurement  ambiguity  for  an  isolated  point  on  a 
solution  locus  is  defined  to  be  zero. 

In  E3.  for  instance,  the  dimension  of  measurement  ambiguity 
for  a  point  satisfying  the  equation  of  a  plane  or  a  sphere  is  2  as  the 
respective  solution  loci  are  2-manifolds.  Consider  the  equation 


(ii)2  +  (iz)  -(*3)  =0 

with  a  solution  locus  of  two  cones  that  meet  head  on.  The  di¬ 
mension  of  measurement  ambiguity  at  the  point  of  contact  is  un¬ 


defined  but  is  of  dimension  2  for  any  other  point  on  the  solution 
locus.  A  meaningful  definition  for  the  dimension  of  measurement 
ambiguity  at  such  singularity  points  is  saved  for  future  work. 

The  implicit  function  theorem  can  be  used  to  determine  the 
dimension  of  measurement  ambiguity  at  a  point  on  a  solution  lo¬ 
cus  determined  by  the  general  system  of  equations  1  in  section  2. 
Assuming  that  each  of  the  <h,  are  of  class  C*7*,  the  system  of  equa¬ 
tions  1  determines  a  class  C*7*  function  $  =  (4>i,$2,  . . .  .4>n_r) 
from  E"  into  En~'  where  ik  =  n  -  r.  1  <  r  <  n.  The  solution 
locus  is  therefore  represented  by  M  =  {x  e  F":4>( x)  =  0}.  The 
implicit  function  theorem  effectively  states  that  if  the  Jacofiian 


J<J>  = 


/  d$\/dx\ 

d$2/d*\ 


d$\/dx2 

d$i/dx2 


0d>\ldi„  \ 
fH>2/0x„ 


\  3$n_r/(9i,  d$„-r/dx2 


dQn-r/dx„  ) 


evaluated  at  a  solution  point  x q  6  M  has  rank  (n-r)  then  there 
exists  an  open  set  W  c  En  rontaining  xo  such  that  VV  n  A/  is  an 
r  manifold.  That  is  the  dimension  of  the  measurement  ambiguity 
at  xo  is  r. 


Without  loss  of  generality  arrange  the  functions  d>,  so  that 
the  last  (n-r)  column  vectors  in  J3>  are  linearly  independent  and 
form  the  (n-r)x(n-r)  matrix  J$  out  of  these  vectors.  Clearly  the 
determinant  |J$|  y(  0.  Let 


x  =  (x,,x2,  ■  ■  ■  ,zr). 


io  =  ((zo)l.(*o)2,---,(zo)r) 


denote  the  vectors  by  taking  the  first  r  components  of  x  and  x0. 
The  following  is  the  formal  statement  of  the  implicit  function 
theorem 


IMPLICIT  FUNCTION  THEOREM:  Let  $  be  of  class 
from  an  open  set  D  C  Eu  into  En~r,  where  q  >  1  and  1  <  r  <  n. 
Let  xo  €  D  be  such  that  <h(xo)  =  0  and  |J$(xo)|  /  0.  Then  there 
exists  an  open  set  U  C  En  containing  Xq ,  an  open  set  V  C  Er 
containing  xq,  and  4>  =  (4>\,<t>2,  ■■■  ,<bn-r)  of  class  on  V  such 
that 


|J$(x)|  yf  0  for  all  x  €  U 


and 


{x  6  U:$(x)  =  0}  =  {(i,0(i)):2  6  V }  . 


Effectively  (x,0(i))  provides  a  local  coordinatization  at  the 
point  x0  on  the  solution  locus  M  =  {x  €  U:$(x)  =  0)}  with  re¬ 
spect  to  ET  and  |J4>(x)|  yf  0  implies  condition  (ii)  of  the  definition 
of  an  r-manifold  made  above.  A  proof  of  a  slightly  more  g'  ,  ral 
version  of  the  implicit  function  theorem  can  be  found  in  (Loomis 
and  Sternberg  1975]. 

It  is  important  to  note  that  while  the  condition  on  the  rank  of 
the  Jacobian  J$  being  (n-r)  at  a  point  x0  satisfying  equations  1  is 
a  sufficient  condition  for  there  to  exist  a  neighborhood  about  x0 
being  an  r-manifold,  it  is  not  a  necessary  condition.  It  is  possible 
for  the  rank  of  J4>  to  be  less  than  (n-r)  at  x0  and  still  have  a 
neighborhood  of  x0  being  an  r-manifold.  If  the  condition  on  the 
rank  of  the  Jacobian  is  violated,  further  analysis  of  the  higher 
order  derivatives  of  the  functions  4>,  with  respect  to  feature  space 
coordinates  is  required.  This  will  be  reserved  for  future  work  and 
will  not  be  presented  here. 

Even  though  the  Jacobian  for  a  set  of  class  C1  '  equations 
may  not  be  of  sufficient  rank  for  the  entire  set,  the  implicit  func¬ 
tion  theorem  is  still  useful  for  establishing  upper  bounds  on  the 
dimension  of  measurement  ambiguity  at  a  point  on  the  solution 
locus.  In  fact  if  the  actual  rank  of  the  Jacobian  J<h  is  w  in  the 
feature  space  £”,  a  value  of  an  upper  bound  on  the  dimension 
of  measurement  ambiguity  is  clearly  (n-w).  From  a  large  number 
of  successive  images  it  is  possible  to  get  a  set  of  equations  which 
number  far  greater  than  n  but  yet  have  their  respective  Jacobian 
with  rank  w  <  n.  This  may  determine  a  search  for  a  different  set 
of  preselected  physical  parameters  to  alter  the  state  of  the  imag¬ 
ing  system  so  that  the  resulting  equations  from  successive  imag  s 
have  a  smaller  upper  bound  on  the  dimension  of  measurement  am¬ 
biguity  at  certain  points.  Reduction  of  even  a  single  dimension 
of  ambiguity  results  in  a  far  superior  generalized  stereo  method. 
Clearly  a  necessary  condition  for  a  unique  n-tuple  feature  state 
point  measurement  by  a  given  generalized  stereo  technique  is  that 
the  dimensionality  be  zero  at  that  point  if  in  fact  the  dimension 
of  measurement  ambiguity  is  defined  there. 
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4  MEASUREMENT  AMBIGUITY  AND 
SYMMETRY 

In  section  2  generalized  stereo  was  presented  at  a  level  of  abstrac¬ 
tion  which  unifies  existing  conventional  stereo  techniques  such  as 
parallax  and  photometric  stereo  and  also  which  serves  as  a  general 
paradigm  for  developing  a  broader  variety  of  stereo  techniques.  As 
the  key  goal  of  any  generalized  stereo  technique  is  to  optimally 
disambiguate  the  physical  measurement  of  a  given  visual  object 
feature  this  new  general  formulation  of  stereo  vision  poses  some 
serious  mathematical  challenges  aimed  at  characterizing  the  con¬ 
ditions  under  which  disambiguation  is  feasible.  This  section  takes 
a  formal  look  at  the  role  that  the  symmetry  of  solution  loci  play 
in  characterizing  these  conditions  for  measurement  disambigua¬ 
tion.  Inspired  by  the  Erlangen  programm  of  Felix  Klein  begun  by 
his  famous  address  at  Erlangen  in  1872,  generalized  stereo  will  be 
formulated  at  a  higher  level  of  abstraction  still,  using  the  notion 
of  the  action  of  a  group  of  symmetries  on  the  n-tuple  point  field 
of  feature  space. 

4.1  Background 

The  main  tool  used  by  Klein’s  Erlangen  programm ,  to  describe 
any  type  of  geometry  specified  by  a  set  of  constraining  axioms 
on  a  point  field,  is  the  group  of  automorphisms  (i.e.  one-to-one 
correspondences)  of  the  point  field  into  itself  that  preserves  the 
validity  of  all  the  constraining  axioms.  It  is  understood  that  a 
point  field  such  as  a  plane  exists  independent  of  the  way  it  is 
coordir.atized.  A  coordinatization  of  a  point  field  is  a  one-to-one 
correspondence  between  the  point  field  and  a  set  of  coordinate 
symbols 

p  —  x  or  x  =  x(p). 

In  the  rase  of  n-dimensional  Euclidean  point  space  the  set  of  co- 
ordina.  •  ‘■ymbols  are  the  n-tuples  (11,12, . . .  ,i»)-  Given  an  au¬ 
tomorphism  of  the  point  field 


a  new  equivalent  coordinatization  is  produced 
r'lp)  =  *(]>')  =  Hop). 

Thm»foro  tho  automorphism  o  is  described  by  the  transformation 
.S’  of  coordinates  from  x  to  x' 


.V{x)  {x  =  x(p), 


:  X(p')}  . 


The  correspondence  o  —  S  is  railed  a  realization  of  an  abstract 
group  f  of  automorphisms  of  a  point  field  if  the  correspondence 
satisfies 


where  /  is  the  identity  automorphism  and  E  is  the  identity  coor¬ 
dinate  transformation  and  T  is  a  coordinate  transformation  cor¬ 
responding  to  the  automorphism  r.  A  correspondence  between 
any  two  groups  satifving  the  properties  shown  in  7  is  said  to  be 
a  homomorphism.  Hence  a  realization  of  a  group  V  is  a  homo¬ 
morphism  of  T  into  the  group  of  coordinate  transformations.  In 
particular,  if  the  realization  is  defined  by  an  injective  (i.e.  one- 
toon*)  homomorphism,  then  the  realization  is  said  to  be  faithful. 
The  abstract  group  I  ran  be  considered  as  an  entity  unto  itself. 


For  a  good  introduction  to  group  theory  consult  [Jacobson  1974] 
or  [Herstein  1975).  The  correspondence  between  automorphisms 
and  coordinate  transformations  satisfied  by  the  relations  in  7  is 
also  said  to  define  an  action  of  the  group  T  on  the  point  field. 

The  abstract  automorphism  group  that  leaves  the  axioms  of 
n  dimensional  Euclidean  geometry  invariant,  for  example,  is  the 
normalizer  subgroup  of  the  group  of  proper  orthogonal  linear 
transformations,  0+(n),  as  a  subgroup  of  the  full  linear  group 
GL(n)  of  invertible  nxn  matrices.  The  normalizer  subgroup  of  a 
subgroup  G'  of  a  group  G  is  defined  to  be  the  subset  of  G 

{g  6  G:  g~lhg  e  G"  for  all  h  6  G'j  . 

It  is  clear  that  this  subset  forms  a  subgroup  of  G  which  at  least 
includes  G'.  The  normalizer  subgroup  of  0+(n)  also  includes  di¬ 
lations  (i.e.  uniform  change  of  scale)  and  improper  orthogonal 
rotations  (i.e.  ’reflections’).  It  is  intuitively  clear  that  the  con¬ 
gruence  of  figures  in  En  constructed  from  n-vectors  is  preserved 
under  the  action  of  this  subgroup  of  linear  transformations.  For 
a  more  in  depth  discussion  of  the  group  theoretic  invariance  of 
Euclidean  geometry  and  other  invariants  of  groups  consult  [Weyl 
1946]. 

The  goal  of  the  Erlangen  programm  was  to  solve  any  problem 
in  a  particular  geometry  purely  in  terms  of  the  group  theoretic 
properties  of  the  transformations  (i.e.  automorphisms)  that  pre¬ 
serve  its  objective  nature  by  leaving  its  defining  axioms  invariant. 
With  a  similar  philosophy  in  mind  the  problem  of  characterizing 
when  an  ambiguous  solution  locus  can  be  disambiguated  is  formu¬ 
lated  purely  in  terms  of  the  interaction  of  the  group  of  automor¬ 
phisms  acting  on  a  feature  space  point  field  that  leaves  invariant 
the  ambiguous  measurement  solution  locus  for  an  object  feature, 
with  the  set  of  stereo  automorphisms  induced  on  feature  space 
by  image  system  state  transitions.  The  theoretical  treatment  is 
very  general  but  has  concrete  applications  in  supplying  intuition 
towards  selecting  image  state  transitions  that  disambiguate  well 
defined  symmetries  possessed  by  a  solution  locus.  Indeed  any 
kind  of  symmetry  inherent  to  a  measurement  solution  locus  is  a 
representation  of  ambiguity. 

The  automorphisms  leaving  invariant  an  ambiguous  solution 
locus  for  an  object  feature  from  a  single  image  preserves  the  phys¬ 
ical  reality  of  the  measurement  with  respect  to  the  physical  state 
of  the  imaging  system  and  the  evaluation  of  the  image  operation 
functional  that  is  used.  For  example,  consider  the  equi reflectance 
measurement  locus  of  local  surface  orientation  in  a  feature  space 
point  field  coordinatized  with  gradient  space  coordinate  symbols 
as  in  figure  2.  With  respect  to  the  observed  image  irradiance  and 
the  given  state  of  the  imaging  system  in  terms  of  the  imaging 
geometry  etc.,  any  point  2-tuple  state  (p, q)  lying  on  the  equire- 
flectance  locus  can  be  replaced  by  any  other  point  2-tuple  state 
(p\g‘)  also  lying  on  the  locus.  Measurement  ambiguity  implies  a 
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Figure  2 
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virtual  indistinguishability  between  distinct  measurement  states 
on  a  solution  locus  which  can  be  represented  by  the  symmetries 
of  one-to-one  correspondences  of  the  feature  space  point  field  into 
itself  leaving  the  solution  locus  invariant.  Generalized  stereo  tech¬ 
niques  attempt  to  generate  a  series  of  solution  loci  produced  from 
image  state  transitions  that  mutually  do  not  possess  the  same 
symmetries.  It  is  clear  that  for  zero  measurement  error  which  is 
assumed  here,  no  matter  how  many  solution  loci  are  generated, 
any  pair  of  loci  will  have  at  least  one  point  in  common. 

The  set  of  automorphisms  leaving  a  solution  locus  invariant 
precisely  describe  the  physical  symmetries  that  are  inherent  to  the 
measurement  of  an  object  feature  with  respect  to  a  given  imag¬ 
ing  system  model.  Figure  1  from  (Wolff  1987a]  illustrates  the 
measurement  of  surface  orientation  using  the  Torrance-Sparrow 
reflectance  model  by  varying  only  the  incident  linear  polarization 
of  the  light  source.  Regardless  of  the  number  of  successive  im¬ 
ages  taken  corresponding  to  different  incident  linear  polarizations 
varying  between  parallel  and  perpendicular  orientation  with  re¬ 
spect  to  the  plane  of  incidence,  the  equireflectance  measurement 
curves  will  pass  through  the  exact  same  two  measurement  point 
states.  The  physical  reason  for  this  results  from  the  invariance 
of  observed  image  irradiance  when  the  angle  between  the  local 
orientation  vector  and  the  plane  of  incidence  determined  by  the 
light  source  and  viewing  vectors,  is  constant  regardless  of  which 
side  of  the  plane  of  incidence  the  orientation  vector  lies.  This 
physical  symmetry  manifests  itself  as  the  invariance  of  the  equire¬ 
flectance  solution  locus  under  the  automorphism  represented  as 
the  improper  orthogonal  rotation 
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in  gradient  space  coordinates.  It  is  quite  evident  that  the  variation 
of  the  physical  parameter  representing  incident  linear  polarization 
in  the  image  irradiance  equation  does  not  alter  this  inherent  sym¬ 
metry.  Therefore  it  is  impossible  to  resolve  between  two  distinct 
but  indistinguishable  measurement  points  that  are  transformed 
to  each  other  by  'reflection’. 

4.2  The  Symmetry  Group  For  A  Solution  Locus 

The  definition  of  a  symmetry  element  for  a  solution  locus  Mn 
is  now  formalized.  Feature  space  of  n-dimensions  will  be  repre¬ 
sented  by  E"  and  unless  otherwise  specified  should  be  interpreted 
as  only  a  point  field.  An  automorphism  of  En  leaving  the  sub¬ 
set  Af0  invariant  is  said  to  stabilize  Ma.  Denote  the  set  of  all 
automorphisms  of  E"  by  Apy  and  the  set  of  all  automorphisms 
stabilizing  A/0  by  A ,\/0.  Clearly  Aa/0  forms  a.  subgroup  of  At n. 

Definition  1  A  symmetry  element  with  res]Kcl  to  a  nonempty 
subset  My  of  En  is  art  equivalence  class  of  automorphisms  that 
stabilize  My  defined  by  the  relation  that  two  automorphisms  are 
equivalent  if  and  only  if  they  nave  exactly  the  same  action  when 
restricted  to  My. 

The  entire  set  of  symmetry  elements  of  My.  denoted  by  .S.vr0, 
form  a  group.  Letting  CAfo  denote  the  equivalence  class  of  auto¬ 
morphisms  in  A \t0  which  are  the  identity  action  on  My  it  can  be 
easily  shown  that  r ,m„  forms  a  normal  subgroup  of  A.\/0  and  that 


defined  by  mapping  a  coset  realizing  an  element  of  Sm0  into  the 
automorphism  in  Am0  with  the  same  action  on  Mo  and  which  is 
the  identity  action  on  the  complement  of  Mo  in  En ,  is  an  injective 
homomorphism  of  Sm0  into  Am0  thus  making  Sm0  a  subgroup  of 
Am0.  5  Since  to  each  point  in  the  feature  space  point  field  there  is 
assigned  a  unique  coordinate  symbol,  the  injective  homomorphism 
<t>cnn  induces  a  faithful  realization  of  Sm0  in  terms  of  a  set  of 
coordinate  transformations  on  feature  space.  The  realization  <f>can 
will  be  referred  to  as  the  canonical  realization  of  Sm0  in  Ae* . 

A  natural  generalization  to  the  definition  of  symmetry  ele¬ 
ments  which  are  mutual  to  a  finite  collection  of  nonempty  pairwise 
intersecting  subsets  of  En  is  as  follows 

Definition  2  A  mutual  symmetry  element  with  respect  to  nonempty 
subsets  Mo ,  Mi , . . . ,  M|  of  En,  where  M ,  fl  A/j  /  0  for  all  i  <  j, 
is  an  equivalence  class  of  automorphisms  of  En  stabilizing  each 
Mo,  Mur*. Mi  individually ,  defined  by  the  relation  that  two  au¬ 
tomorphisms  are  equivalent  if  and  only  if  the  action  of  these  two 
automorphisms  is  exactly  the  same  on  flj _0M,. 

Denote  the  set  of  automorphisms  of  En  stabilizing  Mo,  M i , . . . ,  Mi 

individually  by  Am0,M, . Mr  It  is  straightforward  to  demonstrate 

that  Am0,Mi . Mi  forms  a  subgroup  of  Ag*.  The  following  lemma 

establishes  the  group  theoretic  equivalence  between  the  set  of  mu¬ 
tual  symmetry  elements  with  respect  to  Mq,M\ M i  and  the 

symmetry  group  Sni  Mi  up  to  isomorphism. 


Lemma  1  Let  P  be  the  set  of  mutual  symmetry  elements  with 
resjiect  to  nonempty  pairwise  intersecting  subsets  Mo,  Mi,...,  Mi 
of  En.  Then  P  forms  a  group  isomorphic  to  Sni  M  . 

Proof:  It  is  first  demonstrated  that  any  automorphism  of 

Am0,m . . Mi  must  in  turn  stabilize  n,=0M,  and  therefore  be  a 

member  of  Ani  .  It  is  then  demonstrated  that  the  equivalence 

classes  of  mutual  symmetry  elements  of  Am0.m, . M,  are  simply 

restrictions  of  the  equivalence  classes  of  symmetry  elements  of 
Ani _  M,  to  automorphisms  with  the  additional  constraint  that 
they  stabilize  Mo,M\ . A//  individually.  This  restriction  map¬ 

ping  provides  an  isomorphism. 

Let  pa  be  an  automorphism  in  Am0.m, . M,  and  let  pa(M.) 

denote  the  range  of  the  action  of  pa  on  the  subset  A/,.  Thus 
p,A  A/,)  =  Af,  for  all  1  <  i  <  l.  Since  "  (n(_0  A/,)  C  M, 
for  all  1  <  i  <  l,  clearly  pa(n'_0  M.)  C  nj_0  Af,.  To  show 
that  pa(nj_0  Af,)  =  nj_0  Af,.  suppose  there  exists  a  point  x  £ 
n',=o  'L  sl'ch  that  pa{x)  e  n'_0  Af,.  Since  AMo,m, . M,  is  a  sub¬ 

group.  p‘‘(n'=0  Af,)  c  nj_0  Af,  which  contradicts  the  fact  that 
!>7,'(Pn(x))  £  n(_0  Af,.  Therefore,  since  p„  is  a  one-to-one  corre¬ 
spondence.  pa( n'1=0  Af,)  =  nj_0  Af,. 

Since  each  element  of  Am0.m , M,  stabilizes  nj=0Af,  it  is  clear 

that  a  mutual  symmetry  element  equivalence  class  with  respect 
to  Mo.  Mi. ...  ,Mi  is  a  restriction  of  the  equivalence  class  of  a 
symmetry  element  of  f>m=u'A/,  having  the  same  exact  action  on 

n;.0A/,  being  further  constrained  by  having  to  stabilize  Ma.  M\ . 

individually.  It  is  clear  that  this  restriction  mapping  is  one-to-one, 

onto.  I  he  fact  that  A  Af, .  Af, . \f,  is  a  subgroup  and  therefore  has 

tile  closure  property  implies  1  hat  the  restriction  mapping  is  a  ho¬ 
momorphism. 

Q.E.D. 
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4.3  Stereo  Automorphisms  And  The  Preserved  Sym¬ 
metry  Subgroup 

Stereo  vision  is  a  dynamic  process  in  the  sense  that  the  physical 
state  of  the  imaging  system  is  always  altered  between  successive 
images.  Section  2  showed  this  as  the  transition  from  one  m-tuple 
state  to  another  m-tuple  state 

(ai,a2,-..,am)  —  K ,4 . a'm)  . 

An  equivalent  way  of  representing  this  state  transition  with  re¬ 
spect  to  its  effect  on  the  measurement  solution  locus  is  in  terms 
of  a  corresponding  one-to-one,  onto  coordinate  transformation  in 
feature  space 

xi  -*ii(zi,i2,...,rn) 
i2  — *2(11,12, •  ..,*„) 

:  (8) 

Xn  —  x'n(xUX2,...,Xn) 

where  the  coordinate  transformation  functions,  x',  satisfy  the  con¬ 
straint  that  (*i,  *2, . . .  ,xn)  solves  the  system 

$i(xi,X2, ...  ,x„,aj,a2, ...  ,a'm,  F1'(/,)Ia)  =  0 

$2(xuX2,...,xn,a'l,a'2,...,a'tn,FI(I,)\A)  =  0 

if  and  only  if  it  solves  the  system  (*))  . 

These  systems  of  equations  correspond  to  the  same  equations  1 
in  section  2  except  without  the  superscript  (k)  for  ease  of  read¬ 
ing.  The  coordinate  transformation  8  therefore  describes  the 
transformation  of  the  measurement  solution  locus  M  when  the 
imaging  system  is  in  state  (01, 02, .. .  ,am)  to  the  measurement 
solution  locus  M'  when  the  imaging  system  state  is  changed  to 
{a\,a'7, . . .  ,a'm).  While  many  coordinate  transformations  depicted 
in  8  may  satisfy  the  above  constraint  all  such  transformations  are 
equivalent  in  the  sense  that  the  range  of  any  coordinate  transfor¬ 
mation  when  restricted  to  M  is  M' .  This  motivates  the  following 
definition 

Definition  3  Given  a  generalized  stereo  technique  let  M  and  M' 
be  two  solution  loci  for  an  object  feature  state  in  En  with  respect 
to  two  physical  states  of  the  imaging  system.  Then  the  set  of 
automorphisms  of  A e*  whose  range  when  restricted  to  M  is  M' 
are  termed  stereo  automorphisms.  Two  stereo  automorphisms  are 
equivalent  if  when  restricted  to  a  solution  locus  M  they  have  ex¬ 
actly  the  same  range. 

For  a  given  generalized  stereo  technique  the  set  of  all  stereo 
automorphisms  are  generated  by  all  possible  physical  state  transi¬ 
tions  of  the  imaging  system.  The  set  of  all  stereo  automorphisms 
does  not  necessarily  form  a  group.  The  only  group  property  which 
this  set  may  not  satisfy  is  closure.  A  stereo  automorphism  bring¬ 
ing  solution  locus  M,  into  solution  locus  Af,  will  be  denoted  by 
t,,  and  its  action  is  shown  as  t,,(M,)  -  My  The  process  of 


disambiguation  for  a  generalized  stereo  vision  technique  can  be 
characterized  by  the  algebraic  interaction  of  the  set  of  stereo  au¬ 
tomorphisms  with  the  symmetry  group  Sm  for  the  measurement 
solution  locus  M  obtained  from  any  successive  image. 

Recall  from  section  2  that  the  solution  locus  Mo  is  defined  to 
be  disambiguated  by  l  +  1  successive  images  (i.e.  /  image  state 
transitions)  if  and  only  if 

n'=0M,  £  M0  ■ 

In  the  same  vein,  the  solution  locus  Mo  is  equivalently  defined  to 
be  disambiguated  by  stereo  automorphisms  {<oi>  to2>  •  •  • ,  to/}-  Be¬ 
fore  making  another  key  definition  an  important  lemma  is  proved. 

Lemma  2  There  exists  an  injective  homomorphism  of  Sn<  M 
into  S\f„  thus  making  S^i  a  subgroup  of  Sm„-  The  solution  lo¬ 
cus  Mo  is  disambiguated  by  stereo  automorphisms  {toi,  to2,  ■  ■  ■ ,  to/} 
if  and  only  if  Sni  M.  is  a  proper  subgroup  of  S\f0 . 

Proof:  Construct  an  injective  homomorphism  using  the  coset 
realization  of  Snt  and  Sm„  using  isomorphisms  <j>}MO  and  <j>}ac> 
respectively.  Deifne  the  map  <pT  by  assigning  a  coset  of  feature 
space  automorphisms  representing  an  element  of  Sni  M  with  the 
element  of  Sm0  represented  by  the  coset  of  feature  space  automor¬ 
phisms  having  the  equivalent  action  on  the  subset  n '=0M,  and  be¬ 
ing  the  identity  mapping  on  the  difference  subset  M0  -  n'1=0Af,.  It 
is  clear  that  this  mapping  is  an  injective  homomorphism.  The  in¬ 
jective  homomorphism  <p,  results  from  the  following  commutative 
diagram 

An‘im0M,/er\'M, - —~AM0/eMa 

t  4>r  I 


The  range  of  this  injective  homomorphism  is  a  proper  sub¬ 
group  of  5m0  ^  *nd  only  if  Mo  -  n|_0Af,  is  non-empty  or  equiva¬ 
lently  n|=0M,'  C  M0  ■ 

Q.E.D. 

Lemmata  1  and  2  together  imply  the  following  commutative 
diagram 


15  "!.<,«• - 


Injective  homomorphism  4>v  is  defined  as  the  composition  of 
isomorphism  </>„„,  which  is  the  restriction  mapping  from  lemma 
1,  with  injective  homorphism  d>,,  defined  in  lemma  2. 


$l(*5(Xl,X2,.  ••  ,*n),xi(Xl,X2,..  -  •-»*»(*!,  *2 . X„),<7|,fl2, 

<J>2(x((x,,X2 . Xn),  X2(x,  ,  X2,  .  .  .  ,  Xn), - X^(x,,X2. - 

-V*;  (X1.X2 . xn),x2(xi , x2 . x„),,  ..x;(x,,x2 . x„).n,,n2. 


F,(  /,)!,,)  =  0 
■■-«m./-<(f«)U>  -=  0 


Definition  4  The  set  of  preserved  symmetry  elements  of  the  symm  Theorem  2  For  a  given  generalized  stereo  technique  a  measure- 


group  S Mo  for  measurement  solution  locus  M0  with  respect  to 
stereo  automorphisms  {<01,  *02,  •  •  ■ ,  to/}  is  the  range  of  the  injective 
homomorphism  Op  from  P  into  Sm0  defined  in  the  commutative 
diagram  immediately  above. 

Clearly  the  set  of  preserved  symmetry  elements  defined  above 
form  a  subgroup  of  Sm0  isomorphic  to  5ni_  M  .  The  canonical 
realization  of  the  set  of  preserved  symmetry 'elements  with  respect 

to  stereo  automorphisms  {f0i,  *02 . to/}  in  -Agn  is  the  restriction 

of  <t>Can  to  the  homomorphic  image  of  P  in  Sm0  with  respect  to 

<t>P- 

4.4  Disambiguation  Theorems 

The  tools  have  been  developed  to  help  prove  some  precise  state¬ 
ments  about  the  conditions  under  which  measurement  disam¬ 
biguation  takes  place. 

Theorem  1  An  element  s  of  the  symmetry  group  Sm0  for  solu¬ 
tion  locus  Mo  with  canonical  realization  p  is  a  preserved  symmetry 
element  with  respect  to  stereo  automorphisms  { toi ,  *02,- •  • ,  to/}  if 
and  only  if  t^'  pt„,  is  a  canonical  realization  for  some  element  of 
S \;r  for  all  1  <  i  <  /. 

Proof:  Suppose  that  s  is  a  preserved  symmetry  element  of 
symmetry  group  Sm0  with  canonical  realization  p.  From  the  def¬ 
inition  of  <t> r  in  lemma  2  p  g  Ae ■»  is  a  canonical  realization  of  a 
preserved  symmetry  element  if  and  only  if  pfn'_0M,)  =  n',=0M, 
and  p  is  the  identity  action  on  the  complement  of  n*_0M,  in  En. 
Hence,  p(M,)  =  p(*o.(Mo))  =  *o,(Mo)  for  all  1  <i<l.  Therefore, 
to.1  (pi*o.(Mo)))  =  to,'{toi{M0))  and  from  the  definition  of  the  ac¬ 
tion  of  a  group  on  a  set  (tg,'pt0,)(M0)  =  M0.  That  is,  tg'ptg, 
stabilizes  M0.  It  needs  to  be  shown  that  to,1  pt°.  is  the  identity 
action  on  the  complement  of  M0  in  En.  The  fact  that  p  is  the 
identity  action  on  the  complement  of  n'_0M,  and  that  for  x  g  M0, 
f0,(r)  i  M„  means  that  (f^’pfo.)(z)  =  *o,!(*o.O))  =  *• 

Suppose  that  tg' ptoi  is  a  canonical  representation  for  some 
element  of  SMo  for  all  1  <  t  <  /.  Then  ( to,1  ptoi )( M0)  =  M0  and  by 
doing  the  reverse  algebra  of  the  previous  paragraph  p(M,)  =  M, 
for  all  1  <  i  <  l.  From  the  argument  of  lemma  1,  this  implies 
that  p(  u[=0M, )  =  n[_0Af, .  Now  suppose  that  p  does  not  have  the 
identity  action  on  the  complement  of  nj=0M,.  Since  p  is  already 
a  canonical  realization  for  an  element  of  Sm0 ,  it  must  at  least 
have  the  identity  action  on  the  complement  of  Mo-  Therefore 
there  must  exist  two  distinct  points  x,x  6  M0,  x.x  £  n ',_0M, 
such  that  p(x)  -  x.  This  means  there  exists  a  I  <  k  <  l  such 
that  x  £  Mk.  Hence  tgk(x)  £  M0  because  the  range  of  tok  acting 
on  Af0  is  Mk.  Since  t0k  is  a  bijection,  tgk{x)  and  t,jk  ( x )  must 
be  distinct.  Since  (tgiptok)tgk  (*)  =  *«(*),  takP<ok  co„ld  not 
possibly  be  a  canonical  realization  of  an  clement  of  Sm0  since  it 
is  not  the  identity  action  on  tgk(x)  i  Mq.  Contradiction.  Hence 
p  must  be  the  identity  action  on  the  complement  of  nj_0Af,  and 
therefore  a  canonical  realization  of  a  preserved  symmetry  element. 

Q.E.D. 

The  following  theorem  establishes  a  purely  algebraic  relation¬ 
ship  between  the  set  of  stereo  automorphisms  and  the  symmetry 
group  Sm„  that  characterizes  when  measurement  disambiguation 
takes  place. 


ment  solution  locus  Mg  is  disambiguated  by  stereo  automorphisms 
{lot ,  *02,  .  •  - ,  to/}  corresponding  to  l  stale  transitions  of  preselected 
imaging  parameters  if  and  only  if  there  exists  a  stereo  automor¬ 
phism  t0l,  1  <»'</,  such  that  t0,  is  not  in  the  normalizer  sub¬ 
group  of  the  canonical  realization  of  Sm„  lrt  A  ■ 

Proof:  Lemma  2  shows  that  a  solution  locus  Mo  is  disam¬ 
biguated  by  stereo  automorphisms  {tot. *02, •  •  • .  to/}  if  and  only 
if  Sn._  >s  a  proper  subgroup  of  Sm0-  Lemma  1  shows  that 
the  set  of  mutual  symmetry  elements,  P,  with  respect  to  Mo  and 
M,  =  to, Mq,  1  <  i  <  l  is  isomorphic  to  Sn,  M  .  Combining 
these  results  with  the  above  commutative  diagrams  implies  that 
a  solution  locus  Mo  is  disambiguated  by  stereo  automorphisms 
{*oi.  *02.  •  •  • ,  to/ }  if  and  only  if  the  set  of  canonical  realizations  for 
preserved  symmetry  elements  is  a  proper  subgroup  of  the  group 
of  canonical  realizations  for  the  symmetry  group  Sm0  in  ■ 

Suppose  that  Mo  is  not  disambiguated  by  stereo  automor¬ 
phisms  {*oi .  *02.  ■  •  • .  *o/}  meaning  the  set  of  canonical  realizations 
for  preserved  symmetry  elements  is  equal  to  the  set  of  canonical 
realizations  of  the  group  Sm0  i'1  -A /c  ■  From  theorem  1,  this  im¬ 
plies  that  each  to,,  1  <  i  <  l,  is  in  the  normalizer  subgroup  of 
the  group  of  canonical  realizations  for  Sm0  in  A £•> .  Stating  the 
contrapositive,  if  there  exists  a  stereo  automorphism  to,  not  in 
the  normalizer  subgroup  of  the  group  of  canonical  realizations  for 
Sm0  in  elf"  then  M0  is  disambiguated  by  stereo  automorphisms 
{*01, *02. • -  ■  >  *0/ } - 

The  proof  of  the  other  direction  is  similar.  Suppose  that  each 
stereo  automorphism  to, ,  1  <  t  <  /,  is  in  the  normalizer  subgroup 
of  the  canonical  realizations  of  Sm0  in  Ab*.  Then  from  theorem 
1  each  canonical  realization  corresponding  to  an  element  in  Smq 
is  a  canonical  realization  for  a  preserved  element  implying  that 
solution  locus  Mo  cannot  be  disambiguated.  The  contrapositive 
of  this  statement  proves  the  rest  of  the  theorem. 

Q.E.D. 

5  CONCLUSION 

A  generalized  axiomatic  definition  of  stereo  vision  was  presented 
that  was  shown  not  only  to  unify  existing  stereo  techniques  hut 
to  lead  to  a  multitude  of  new  stereo  techniques  not  yet  consid¬ 
ered.  Several  new  examples  of  this  generalized  variety  of  stereo 
vision  were  given.  A  given  generalized  stereo  technique  is  com¬ 
pletely  determined  by  (i)  a  visual  object  feature  to  be  measured, 
(ii)  an  image  operation  functional  (iii)  a  preselected  set  of  phys¬ 
ically  independent  imaging  parameters  at  least  one  of  which  is 
to  be  varied  between  each  successive  image  and  (iv)  a.  system 
of  camera  modeling  equations.  Previously  only  geometric  imag¬ 
ing  parameters  were  varied  in  stereo  techniques  to  measure  ob 
ject  features  such  as  position  or  local  surface  orientation.  Stereo 
techniques  that  measured  surface  orientation  were  presented  that 
varied  physical  imaging  parameters  such  as  incident  wavelength 
and/or  polarization  and  it  was  suggested  that  surface  orientation 
could  also  be  measured  by  varying  object  surface  temperature  or 
external/internal  magnetic  and/or  electric  fields.  The  direct  mea¬ 
surement  of  surface  curvature  at  a  point  on  a  surface,  independent 
of  knowledge  about  neighboring  surface  normals,  could  also  be  ac¬ 
complished  from  the  variation  of  any  of  these  imaging  parameters 
using  a  3  tuple  image  operation  functional  consisting  of  image  ir 
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radiance  and  the  two  components  of  the  image  gradient.  This 
by  no  means  saturates  all  the  possibilities  for  generalized  stereo 
techniques. 

The  central  theme  of  measurement  of  object  features  by  a 
generalized  stereo  technique  is  disambiguation  of  measurement 
solution  loci  in  feature  space.  Theoretical  developments  were  pre¬ 
sented  to  characterize  the  ’’size”  of  measurement  ambiguity  at  a 
point  on  a  solution  locus  as  well  as  the  conditions  under  which 
measurement  disambiguation  takes  place.  With  a  complete  norm 
defined  on  feature  space,  measurement  solution  loci  were  consid¬ 
ered  as  the  union  of  embedded  sub-manifolds  and  one  way  of 
characterizing  the  ’’size”  of  measurement  ambiguity  at  a  point  is 
the  evaluation  of  the  dimension  of  a  sub-manifold  neighborhood 
containing  that  point,  if  in  fact  one  existed.  The  implicit  func¬ 
tion  theorem  was  shown  to  provide  an  algorithm  for  evaluating 
an  upper  bound  on  the  dimension  of  measurement  ambiguity  at 
a  point. 

In  section  4  the  formalism  for  generalized  stereo  was  recast  into 
the  language  of  group  theory  emulating  the  elegant  philosophy  of 
Klein’s  Erlangen  programm.  Measurement  ambiguity  of  a  solution 
locus  was  equivalently  presented  as  the  manifestation  of  symme¬ 
try  which  is  described  by  its  corresponding  symmetry  group  and 
represented  in  terms  of  feature  space  automorphisms  that  leave 
the  solution  locus  invariant.  Imaging  system  state  transitions  take 
the  form  of  feature  space  automorphisms  which  attempt  to  break 
up  symmetries  of  solution  loci  thus  reducing  measurement  ambi¬ 
guity.  When  expressed  in  this  formalism  very  precise  statements 
can  be  made  about  the  conditions  under  which  a  given  general¬ 
ized  stereo  technique  can  disambiguate  a  measurement  solution 
locus.  A  key  theorem  was  proven  using  the  canonical  realization 
of  the  symmetry  group  in  terms  of  feature  space  automorphisms 
for  a  solution  locus  stating  that  an  image  state  transition  dis¬ 
ambiguates  a  solution  locus  if  and  only  if  its  corresponding  stereo 
automorphism  is  not  in  the  normalizer  subgroup  of  the  realization 
of  the  symmetry  group  for  the  solution  locus. 

This  paper  is  merely  an  introduction  to  the  application  and 
theory  of  generalized  stereo  techniques.  Much  work  remains  to 
discover  and  implement  new  stereo  techniques  as  well  as  to  de¬ 
velop  a  rich  mathematical  formalism  that  embodies  these  tech¬ 
niques. 
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Abstract 

A  stochastic  optimization  approach  to  stereo  matching  is  pre¬ 
sented.  Unlike  conventional  correlation  matching  and  feature 
matching,  the  approach  provides  a  dense  array  of  disparities, 
eliminating  the  need  for  interpolation.  First,  the  stereo  match¬ 
ing  problem  is  defined  in  terms  of  finding  a  disparity  map  that 
satisfies  two  competing  constraints:  (1)  matched  points  should 
have  similar  image  intensity,  and  (2)  the  disparity  map  should 
vary  as  slowly  as  possible.  These  constraints  are  interpreted  as 
specifying  the  potential  energy  of  a  system  of  oscillators.  Ground 
states  are  approximated  by  a  new  variant  of  simulated  anneal¬ 
ing,  which  has  two  important  features.  First,  the  microcanonical 
ensemble  is  simulated  using  a  new  algorithm  that  is  more  effi¬ 
cient  and  more  easily  implemented  than  the  familiar  Metropolis 
algorithm  (which  simulates  the  canonical  ensemble).  Secondly, 
it  uses  a  hierarchical,  coarse-to-fine  control  structure  employing 
laplacian  pyramids  of  the  stereo  images.  In  this  way,  quickly- 
computed  results  at  low  resolutions  are  used  to  initialize  the 
system  at  higher  resolutions. 

1  Introduction 

Few  problems  in  computational  vision  have  been  investigated 
more  vigorously  than  stereo.  Compared  to  other  modes  of  depth 
perception,  stereo  vision  seems  relatively  straightforward.  The 
images  received  bv  two  eyes  are  slightly  different  due  to  binocu¬ 
lar  parallax:  that  is,  they  exhibit  a  disparity  that  varies  over  the 
visual  field,  and  that  is  inversely  related  to  the  distance  of  im¬ 
aged  points  from  the  observer.  If  we  can  determine  this  disparity- 
field  we  ran  measure  depth  and  mimic  human  stereo  vision. 

This  paper  describes  an  approach  to  stereo  in  which  (lie 
matching  problem  is  posed  as  computational  analogy  to  a  ther¬ 
modynamic  physical  system.  The  state  of  the  system  encodes 
a  disparity  map  that  specifies  the  correspondence  between  the 
images.  Each  such  state  has  an  energy  that  provides  a  heuristic 
measure  of  the  “quality"  of  the  correspondence.  To  solve  the 
stereo  matching  problem,  one  looks  for  the  ground  state:  that 
is.  the  state  (or  states)  of  lowest  potential  energy. 

Support  for  this  work  was  provided  by  the  Defense  Advanced  Research 
Projects  Agency  under  contracts  DCA76- f'-libut  and  MDA903  8G-< 
0084. 


The  remainder  of  this  section  briefly  discusses  the  major  ap¬ 
proaches  to  stereo  matching.  In  Section  2  the  model  system 
for  the  stochastic  optimization  method  is  defined.  Section  3  de¬ 
scribes  how  a  stochastic  technique  called  simulated  annealing 
can  be  used  to  perform  the  optimization  and  introduces  a  new, 
more  efficient  variety  of  simulated  annealing,  called  microcanon¬ 
ical,  hierarchical  annealing.  This  method  samples  a  microcanon¬ 
ical  ensemble  over  a  sequence  of  increasingly  finer  scales.  Several 
experimental  results  are  given  in  Section  4.  Section  5  concludes 
with  some  observations. 

1.1  Background 

The  development  of  computational  models  of  stereo  vision  has 
been  guided  by  both  scientific  and  technological  motivations. 
The  modularity  of  stereopsis  in  the  human  visual  system,  con¬ 
clusively  demonstrated  by  random-dot  stereograms,  indicates 
that  this  perceptual  function  can  be  studied  in  isolation.  If  the 
same  computational  principles  used  for  stereo  also  apply  to  other 
modes  of  perception  a  successful  model  of  stereo  could  suggest 
models  for  other  vision  problems.  Stereo  finds  important  prac¬ 
tical  applications  in  mapping  and  robot  sensing. 

1.2  Correlation 

Perhaps  the  most  obvious  approach  to  stereo  matching,  loosely 
called  “correlation."  to  choose  intensity  patches  in  one  image 
and  then  to  search  for  the  best  matching  location  in  the  o'.ier 
image,  typically  using  normalized  cross-correlation  as  a  measure 
of  similarity  or  mean-square-difTereuce  as  a  measure  of  dissimi¬ 
larity.  Many  variations  of  this  basic  theme  have  been  explored. 

This  gener.  1  approach  suffers  from  some  difficult  problems. 

1.  The  size  of  the  patches  affects  the  likelihood  of  false 
matches.  A  patch  must  he  large  enough  to  contain  the  infor¬ 
mation  necessary  to  specify  another  patch  unambiguously: 
or.  failing  this,  some  additional  means  of  disambiguating 
false  matches  must  be  used. 

2.  At  the  same  time,  the  patches  must  be  small  compared  to 
the  variation  in  the  disparity  map.  If  the  patches  are  too 
large  the  system  will  lie  insensitive  to  significant  relief  in 
the  scene.  These  problems  have  motivated  the  use  of  scale 
hierarchies.  (See  1.1  below.) 
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3.  In  typical  images  much  of  the  area  consists  of  uniform  or 
slowly  varying  intensity,  and  correlation  will  not  be  sensitive 
in  such  cases.  In  practice,  a  correlation  method  can  provide 
only  a  relatively  sparse  set  of  correspondences,  from  which 
a  dense  map  must  then  be  interpolated. 

1.3  Feature  Matching 

Another  approach  is  to  attempt  matching  only  on  ■■information- 
rich”  points.  Even  in  correlation  methods  an  interest  operator 
is  often  used  to  screen  patches.  The  feature-matching  approach 
seeks  to  establish  correspondences  directly  between  discrete  sets 
of  points  —  typically,  the  output  of  an  edge  detector,  such  as 
zero-crossing  contours. 

This  approach  suffers  from  similar  difficulties: 

1.  The  support  of  the  feature  detector  affects  the  likelihood 
of  false  matches.  Zero-crossings  from  high-frequency  bands 
will  probably  have  many  ambiguous  matches  for  significant 
ranges  of  disparity. 

2.  The  support  of  the  feature  detector  must  be  small  com¬ 
pared  to  the  variation  in  the  disparity  map  (caused  bv  re¬ 
lief  in  the  scene)  if  the  2D  features  are  to  locate  3D  features 
accurately. 

3.  Feature  matching  provides  sparse  matches  by  definition. 

1.4  Scale  Hierarchy 

Disparity  scales  linearly.  This  suggests  that  a  stereo  matcher 
can  begin  its  search  at  a  coarse  scale,  find  coarsely-quantized 
disparities,  use  this  result  to  initialize  its  search  at  a  finer  scale, 
and  »o  oti.  In  addition  to  improving  efficiency  by  limiting  the 
effective  search  space  of  the  matcher,  this  technique  ameliorates 
the  false  target  problem.  Coarse-to-finc  control  strategies  have 
been  used  in  both  correlation  and  feature-matching  models. 

1.5  Lattice  and  Variational  Models 

Several  models  of  stereo  vision  fit  neither  the  correlation  nor  the 
feature-matching  paradigms;  instead,  they  pose  the  matching 
problem  in  terms  of  optimizing  aglobal  measure  [1,2.3],  To  take 
one  example,  Julesz  proposed  a  model  consisting  of  two  lattices 
of  spring-loaded  magnetic  dipoles,  representing  the  two  images 
of  a  random-dot  stereogram  [2],  The  polarity  of  the  dipoles 
represents  whether  pixels  in  the  left  and  right  images  are  black 
or  white.  A  state  of  global  fusion  is  achieved  in  the  ground  state, 
with  the  attraction  or  repulsion  of  the  dipoles  balanced  by  the 
forces  of  the  springs. 

More  recently.  Poggio  ft.  a!,  have  proposed  a  regularization 
criterion  based  on  minimizing  the  following  quantity  [4j: 

£  =  J  J  {[V2Go(rL[x.y)-IR(x  +  n{x,y).y))\1+X(VD)2},lxfly. 

(I) 

where  //.  and  IR  are  continuous  intensity  functions  in  the  left  and 
right  visual  fields,  V2(V  is  a  linear  bandpass  filter  (laplacian  of  a 
gaussian).  V  72  is  the  gradient  of  disparity,  and  A  is  a  constant. 
Equation  ( 1 )  can  be  justified  in  terms  of  two  heuristics:  the  first 
term  in  the  integrand  is  a  measure  of  photometric  difference:  the 
second  is  a  measure  of  the  first-order  variation  in  the  disparity 
map.  In  this  heuristic  sense  it  is  similar  to  the  Julesz  spring- 
dipvle  model,  with  the  two  terms  corresponding  to  the  potential 


energy  of  the  dipoles  and  the  springs,  respectively.  Of  course, 
(1)  has  the  advantage  of  being  precise,  as  well  as  addressing  the 
case  of  continuous  intensity. 

Witkin  et.  at.  described  a  method  for  optimizing  (1)  that  is 
essentially  a  sophisticated  form  of  gradient  descent  which  tracks 
the  solution  over  increasingly  finer  scales  [5],  The  hope  is  that 
£  is  convex  at  a  coarse  scale  and  that  relatively  coarse  inter¬ 
mediate  solutions  will  place  the  system  in  the  correct  convex 
region  at  finer  scales.  They  report  that  the  method  is  prone 
to  error  when  it  encounters  bifurcations  in  its  trajectory.  As 
the  scale  becomes  finer  the  system  must  ■'choose"  which  path 
to  follow,  and  it  cannot  recover  from  a  mistake  because  £  may 
never  increase.  The  solution  is  therefore  critically  dependent  on 
initial  conditions.  This  paper  presents  an  alternative  stochastic 
method  that  can  cope  with  this  problem. 


2  The  Model  System 


2.1  Epipolar  Camera  Model 


We  assume  that  two  coplanar  images  C(x.y)  and  7 Z(x,y)  are 
formed  by  central  projection  with  focal  length  /.  and  with  the 
centers  of  projection  separated  by  distance  B  along  a  baseline 
parallel  to  the  image  planes.  For  convenience,  we  assume  that 
0  <  x,  y  <  1.  If  a  point  (x.y.f)  in  the  left  image  matches  point 
(x\y',f)  in  the  right  image  (that  is,  it  has  disparity  d  -  x'  -  x), 
the  3D  coordinates  of  the  imaged  point  with  respect  to  the  left 
'•amera  are 


P 
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Under  these  conditions,  the  disparities  are  restricted  to  the 
horizontal  ( a: )  direction.  This  assumption  involves  no  loss  of 
generality,  because  if  the  relative  positions  and  orientations  of 
the  two  cameras  are  known,  as  well  as  the  internal  camera  pa¬ 
rameters,  correspondences  are  restricted  to  to  epipolar  lines.  If 
the  epipolar  lines  are  not  horizontal  the  images  can  be  mapped 
into  a  “normal”  stereo  pair  in  which  they  are. 


2.2  Cyclopean  Disparity  Map 

At  this  point  we  identify  C  and  TZ  with  V2Go Ii.(x. ,v)  and  V26"o 
IR(x.y)  in  (1).  We  seek  a  disparity  map,  V(x,y),  defined  over 
the  same  interval  as  C  and  7c.  which  specifies  the  correspondence 
between  C  and  TZ.  The  cvclopean  representation  introduced  by 
Horn  [6]  defines  V  with  the  following  relation: 

r,  V(x,y)  ,  V(x.y)  % 

L(x - - — ,y)  corresponds  to  IZ(x  -I - - — ,y)  . 

The  major  advantage  of  the  cvclopean  representation  is  that,  by- 
defining  disparity  without  preference  to  either  image,  it  allows 
a  more  uniform  treatment  of  occlusion  boundaries. 

Rewriting  the  integrand  of  equation  ( 1 )  in  the  cvclopean  form 
we  have: 


re,  ,  -rw  ,  T>(x.y)  ,2 

[T(r  -  - : — .y)  -  JZ{x  +  — - — .  j/)] 

+  A[VP(.r.  j/)]2 
t:Ar.y)  +  X£tlr.y)  . 


(2) 

(3) 


and  the  quantity  to  hr  minimized  is 
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l'igtirp  1:  Spi  t  ns  model. 


2.3  A  Spring  Model 

To  I'sKildisli  it  tnitcrele  idea  of  the  iiieanint;  of  (it)  and  (1).  con¬ 
sider  the  spring  model  illustrated  in  one  dimension  in  Figure 
I. 

The  model  consists  of  two  surfaces.  IHx.y)  below  and 
C(x,i/)  +  ,V]  above.  Midway  between  these  surfaces  is  a  lat¬ 
tice  of  pivot  points,  and  at  each  such  point  is  an  elastic  lever 
arm.  with  rest  length  .S't  and  spring  constant  kt.  The  lever  arms 
are  free  to  rotate  in  t lie  (x.:)  plane  (i.e..  in  epipolar  planes), 
and  their  endpoints  are  constrained  to  lie  on  the  two  surfaces. 
The  lever  arms  are  connected  to  their  neighbors  by  other  springs 
with  spring  constant  k j  which  exert  a  torque  over  moment  arm 
.1.  The  angles  of  the  lever  arms  represent  disparity  on  an  M2 
rvclopean  lattice: 


D(i.j)  =  P( x, .  !j} )  .  ))</./<  .1/ 


P(i.j)  ~  .V,  sin  0,  j  %  ■ 

I  he  potential  energy  stored  in  a  lever  arm  is  approximately1 

«■(..;)*  UtlCi.r- 

and  in  a  connect ing  spring  is  approximately 

1  \‘ 


'Hiii.jA'Jl  ^  —  j  ]D{iJ)  -  D(k\l)Y  . 

I  ’Ik*  energy  associated  with  a  single  [attire  point  is 

4  -  ^  .  (*n 
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where  A',tJ  is  the  set  of  neighbors  of  ( i.j ).  The  energy  of  the 
entire  system  is 

H(D)  = 

Comparing  terms  between  (3)  and  (5),  we  have  approximately 
H{D)  x  £(V) 


with 


We  can  therefore  interpret  £  as  a  hamiltonian  specifying  t lie 
energy  of  the  spring  system,  neglecting  kinetic  energy  terms. 
The  constant  A  is  proportional  to  the  relative  stiffness  of  the 
two  types  of  springs. 

A  physical  realization  of  the  spring  mode)  would  bo  a  dynamic 
system  of  oscillators  that  would  follow  a  trajectory  through  a 
2 A f  dimensional  phase  space.  (Kaclt  lever  arm  has  two  degrees 
of  freedom:  ff  and  0.)  We  could  flesh  mil  this  model  by  spec¬ 
ifying  the  moments  of  inertia  and  damping  coenicionts  of  the 
lever  arms.  We  could  also  add  a  periodic  forcing  function  to 
add  energy  to  the  system,  balancing  the  energy  dissipated  by 
damping.  Having  done  this,  we  could  could  write  a  hamilto- 
uian.  containing  bot h  potential  lerms  depending  on  0  and  kinetic 
lerins  depending  on  0.  describing  the  model's  deterministic  dy¬ 
namic  behavior.  In  principle,  we  could  trace  the  trajectory  of 
t  le*  system  through  phase  space,  gradually  reducing  the  ampli¬ 
tude  (/the  forcing  ftmetion  while  keeping  the  system  in  dynamic 
equilibrium.  There  is  little  point  in  simulating  the  dynamics  in 
si  **h  detail,  however,  since  we  know  that  even  low-dimensional 
f-  reed  oscillators  have  chaotic  aM factors  [?].  The  dynamics  will 
be  efiert  ively  >tochas1  ic. 

An  alternative  and  much  less  expensive  approach.  which  we 
shall  take  in  tiie  next  two  sections,  is  to  explicitly  acknowledge 


the  stochastic,  ergodic  nature  of  the  model.  Kinetic  energy  will 
be  modeled  as  heat. 


1.  Begin  with  the  system  in  an  arbitrary  state  u. 


3  Stochastic  Optimization 


We  have  already  partially  discretized  (1)  by  defining  the  lattice 
D  on  V.  At  this  point  we  similarly  define  lattices  L  and  R  on 
C  and  TZ.  D  now  has  integer  values  and  is  interpreted  as: 


,J) 


corresponds  to  R{i  + 


Equation  (3)  becomes 


A'(U)  =  [I(i- -J)-Rd+  J)?+^D(iJ)}\ 

(6) 


[VD(iJ)?=  Y.  Mi.j)- 1M-.D? 

k,l£Af i,j 

The  total  potential  energy  is 


E  =  Y,E(<J) 


In  terms  of  the  spring  model,  the  ends  of  the  level  arms  are 
now  constrained  to  lie  on  a  finite  number  of  positions  on  the 
two  surfaces.  Although  this  system  is  finite,  it  is  mso  high¬ 
dimensional.  and  the  problem  of  finding  minimal-energy  states 
is  still  difficult  because  the  number  of  possible  states  is  vast  (ex¬ 
ponential  in  Af2).  Furthermore,  there  is  no  reason  to  believe 
that  E  is  convex,  and  therefore  no  reason  to  believe  that  a  dis¬ 
crete  iterative  improvement  algorithm  would  work. 


3.1  Standard  (Canonical)  Annealing 


Simulated  annealing  is  a  fairly  new  technique  for  solving  such 
combinatorial  optimization  problems.  In  Section  3.3  a  new  vari¬ 
ety  of  simulated  annealing  (called  microcanonical  annealing)  is 
presented  which  has  several  advantages  for  computer  implemen¬ 
tation.  In  Section  3.1  the  basic  principles  of  the  standard  form 
of  simulated  annealing  are  described  to  set  a  context,  for  the  in¬ 
troduction  of  microcanonical  annealing.  An  excellent  review  of 
simulated  annealing  may  be  found  in  Laarhoven  and  Aarts  [9]. 

The  most  fundamental  result  of  statistical  physics  is  the  Boltz¬ 
mann  (or  Gibbs)  distribution: 


PrlE.) 


exp  (-EJkT) 
Z[T) 


which  gives  the  probability  of  finding  a  system  in  state  /  with  en¬ 
ergy  E,,  assuming  that  the  system  is  in  equilibrium  with  a  large 
heat  bath  at  temperature  kT  ( k  is  Boltzmann's  constant).  The 
normalizing  quantity  in  the  denominator,  called  the  partition 
function,  is  a  sum  over  all  accessible  states  ir: 


z(n  =  2>xp(-f;„/kr; 


Physicists  arc  generally  interested  in  calculating  macroscopic 
properties  of  model  systems  at  various  temperatures.  The  av¬ 
erage  value  of  some  macroscopic  variable  A  (which  may  be  the 
average  energy  of  the  system,  for  example)  can  be  written: 


(,1)  = 


uxp,  -L„ ,;.t; 

/A  T) 


2.  Make  a  small  change  to  the  state,  typically  by  changing  the 
system  in  only  one  degree  of  freedom.  Call  the  new  state 

i /. 


3.  Evaluate  the  resulting  change  in  energy:  A E  =  £„<  -  Eu. 


4.  If  A E  <  0  (that  is,  the  change  takes  the  system  to  a  state 
of  lower  energy)  accept  the  change. 


5.  If  A E  >  0  accept  the  change  with  probability 
exp(-AE/kT). 


6.  Repeat  steps  (2)  through  (5)  until  the  system  reaches  equi¬ 
librium. 


Figure  2:  The  Metropolis  algorithm. 


Unfortunately,  the  partition  function  is  usually  impossible  to 
calculate. 


In  1953  Metropolis  el.  al.  [8]  described  a  Monte  Carlo  algo¬ 
rithm  that  generates  a  sequence  of  states  which  converges  to  the 
Boltzmann  distribution  in  the  limit  (Figure  2).  This  method, 
which  simulates  the  effect  of  allowing  the  system  to  interact 
with  a  much  larger  heat  bath,  samples  what  is  called  the  canon¬ 
ical  ensemble.  Macroscopic  parameters  can  then  be  calculated 
without  knowledge  of  the  partition  function  by  aveiaging  over 
long  sequences,  weighting  each  state  by  its  Boltzmann  factor 
exp(~E„/kT).  The  Metropolis  algorithm  begins  in  a  random 
state  and  then  successively  generates  small,  random  state  tran¬ 
sitions  (iz  — *  v')  with  the  following  probability: 


1  if  AE  <  0 

exp[-AE/kT)  otherwise 


where  AE  =  Ev <  -  E„.  Asymptotic  convergence  of  the  Metropo¬ 
lis  algorithm  to  the  Boltzmann  distribution  is  guaranteed  if  the 
process  for  generating  candidate  state  transitions  is  ergodic. 

Kirkpatrick  [10]  and  Cerny  [11]  independently  recognized  a 
profound  connection  between  the  Metropolis  technique  and  com¬ 
binatorial  optimization  problems.  If  the  energy  of  a  state  is  con¬ 
sidered  as  an  objective  function  to  be  minimized,  the  minimum 
can  be  approximated  by  generating  sequences  at  decreasing  tem¬ 
peratures.  until  finally  a  ground  state,  or  a  state  with  energy 
very  close  to  to  a  ground  state,  is  reached  at  kT  =  0.  This  is 
analogous  to  the  physical  process  of  annealing. 

There  are  results  showing  the  existence  of  annealing  schedules 
(i.e.,  the  rate  of  decrease  of  temperature)  that  guarantee  con¬ 
vergence  to  ground  states  in  finite  time  [12],  but  these  schedules 
are  too  slow  for  practical  use.  Faster  ad  hoe  schedules  have  been 
used  in  many  problems  with  good  average-case  performance. 
While  faster  schedules  may  not  find  an  optimal  state,  they  ran 
converge  to  states  that  are  very  close  to  optimal.  The  application 
of  standard  simulated  annealing  to  the  stereo  matching  problem 
is  straightforward.  An  early  version  is  described  in  [13],  (Mar- 
roquin  [1-1]  and  Divko  [15]  have  independently  described  similar 
in-.-t'  O.I.J.)  In  'fin  following  section'  a  more  efficient  version  is 
presented. 


3.2  Annealing  over  Scale 

Simulated  annealing  could  be  applied  directly  to  a  pair  of  stereo 
images  at  the  finest  scale,  but  the  convergence  would  be  rather 
slow  if  the  images  had  a  large  range  of  disparity.  This  was  the 
approach  reported  in  [13]  (using  the  standard  annealing  algo¬ 
rithm  and  a  slightly  different  energy  function). 

A  more  efficient  method  is  to  use  the  coarse-to-fine  strategy 
that  has  been  found  to  be  so  effective  in  other  image-matching 
work.  At  a  coarse  level  of  resolution  the  number  of  lattice  sites 
and  the  range  of  disparity  are  small;  therefore,  the  size  of  the 
state  space  is  relatively  small.  We  should  be  able  to  compute  an 
approximate  ground  state  quickly,  and  then  use  it  to  initialize 
the  annealing  process  at  the  next,  finer  level  of  resolution,  and 
so  on. 

The  laplacian  pyramid,  originally  developed  as  a  compact 
image-coding  technique  [16],  ofTers  an  efficient  representation  for 
hierarchical  annealing.  In  a  laplacian  pyramid  an  «  x  n  image 
(for  convenience,  assume  n  is  a  power  of  2)  is  transformed  into 
a  sequence  of  bandpass-filtered  copies,  {Ik.  k  =  0, . . . ,  n},  where 
Ik  is  an  image  of  size  2"~<r  x  2n~k.  Each  image  is  therefore 
smaller  than  its  predecessor  by  a  factor  of  1/2  in  linear  dimen¬ 
sion  and  a  factor  of  1/4  in  area.  We  will  refer  to  Ik  as  the  image 
at  level  k.  The  center  frequency  of  the  passband  is  reduced 
by  one  octave  between  levels.  This  transform  can  oe  computed 
efficiently  by  recursively  applying  a  small  generating  kerne!  to 
create  a  gaussian  (low-passed)  pyramid,  and  then  differencing 
successive  low-passed  images  to  construct  the  laplacian  pyra¬ 
mid.  The  difference-of-gaussians  gives  a  good  approximation  to 
the  V2G  filter. 

After  constructing  laplacian  pyramids  from  the  original  stereo 
images,  disparity  is  reduced  by  a  factor  of  1/2  in  successive 
scales.  Therefore,  at  some  level,  disparity  is  small  everywhere. 
For  typical  stereo  images,  we  can  take  this  to  be  level  n  -  3. 
(For  example,  if  the  original  images  were  a  power  of  2  in  linear 
dimension,  the  laplacian  images  at  level  n  -  3  would  be  8  x  8 
pixels.  Disparities  in  the  range  of  0  to  63  pixels  in  a  pair  of 
512  x  512  images  would  be  reduced  to  0,  with  truncation.)  We 
shall  start  annealing  at  this  level,  find  an  approximate  ground 
state,  and  then  expand  the  solution  to  the  next  scale.  To  make 
this  coarse-to-fine  strategy  work,  however,  we  must  specify  how 
a  low-resolution  result  is  used  to  start  the  annealing  process  at 
the  next-higher  scale. 

Expanding  a  low-resolution  result  to  the  next  level  presents  a 
problem.  Obviously,  one  should  begin  by  simply  doubling  the 
size  of  the  low-resolution  lattice  and  doubling  the  disparity  val¬ 
ues.  Having  done  this,  however,  the  new  state  has  a  low  energy 
but  the  system  is  not  close  to  equilibrium.  Every  odd  dispar¬ 
ity  value  is  “unoccupied.”  and  the  new  map  is  therefore  more 
uniform  than  it  should  be.  This  spurious  uniformity,  which  is 
solely  due  to  the  quantization  of  the  previous  result,  is  likely  to 
place  the  system  near  a  local  minimum  from  which  it  will  not 
recover.  Fortunately,  there  is  an  easy  solution  to  this  problem; 
destroy  this  uniformity  by  adding  heat.  Simply  run  the  anneal¬ 
ing  algorithm  “in  reverse”  by  adding  energy  instead  of  removing 
it.  Heating  may  proceed  much  faster  than  cooling  because  the 
system  relaxes  to  equilibrium  quickly  at  high  temperatures. 

3.3  Microcanonical  Annealing 

Creutz  has  described  an  interesting  alternative  in  tl,e  M m ,,,,,,, 
lis  aigorithm[17],  outlined  in  Figure  3.  Instead  of  simulating 


1.  Begin  with  the  system  in  an  arbitrary  state  u. 

2.  Make  a  small  change  to  the  state,  typically  by  changing  the 
system  in  only  one  degree  of  freedom.  Call  the  new  state 

t/. 

3.  Evaluate  the  resulting  change  in  energy:  AE  =  E„<  -  E „. 

4.  If  A E  <  0  accept  the  change  and  increase  the  demon  energy 
(ED-ED~AE). 

5.  If  A E  >  0  accept  the  change  contingent  upon  Ed' 

•  If  AE  <  Ed  accept  the  change  and  decrease  the  demon 
energy  {Ed  —  Ed  -  AE). 

•  Otherwise,  reject  the  change. 

6.  Repeat  steps  (2)  through  (5)  until  the  system  reaches  equi¬ 
librium. 


Figure  3:  The  Creutz  algorithm. 

the  effect  of  a  large  heat  bath,  the  Creutz  algorithm  simulates  a 
thermally  isolated  system  in  which  energy  is  conserved.  Samples 
are  drawn  from  the  microcanonical  ensemble.  One  can  imag¬ 
ine  the  difference  between  the  Metropolis  algorithm  and  the 
Creutz  algorithm  as  follows.  The  Metropolis  algorithm  gener¬ 
ates  a  “cloud”  of  states,  each  with,  in  general,  different  energies, 
which  fills  a  volume  of  phase  space.  As  temperature  decreases 
this  volume  contracts  to  one  or  more  ground  states.  The  Creutz 
algorithm,  by  contrast,  generates  states  on  a  constant-energy 
surface  in  a  somewhat  larger  phase  space.  As  energy  decreases 
these  surfaces  shrink  to  the  same  set  of  ground  states. 

The  simplest  way  to  accomplish  this  is  to  augment  the  system 
with  one  additional  degree  of  freedom,  called  a  demon,  which 
carries  a  variable  amount  of  energy,  Ed •  This  demon  holds  the 
kinetic  energy  of  the  system  and.  in  effect,  replaces  the  heat 
bath.  The  total  energy  of  the  system  is  now 

Etotal  ~  EptltenUal  -f  Ekimttc 
=  E  +  Ed 

The  demon  energy,  being  kinetic,  is  constrained  to  be  nonnega¬ 
tive.  The  algorithm  accepts  all  transitions  to  lower  energy  states, 
adding  -  AE  (the  energy  given  up)  to  Ed-  Transitions  to  higher 
energy  are  accepted  only  when  AE  <  F.d.  and  the  energy  gained 
is  taken  away  from  Ep-  Total  energy  remains  constant. 

Microcanonical  annealing  simply  replaces  the  Metropolis  algo¬ 
rithm  with  the  Creutz  algorithm.  Instead  of  explicitly  reducing 
temperature,  the  microcanonical  annealing  algorithm  reduces 
energy  by  gradually  lowering  the  value  of  Ed-  Standard  argu¬ 
ments  can  be  used  to  show  that  at  equilibrium  Ep  assumes  a 
Boltmann  distribution  over  time  [17]; 

Er(Ep  =  E)  ex  exp(-E/ATj  . 

Temperature  therefore  emerges  as  a  statistical  feature  of  the 
system; 

LT  =  (E/j)  •  (9) 

This  simple  version  of  microcanonical  annealing,  using  only 
i""  den not  "ui'c.I  ii  ;  b;  ; Cmentatiun.  E.wl.  ,h  ■  i 

sion  to  accept  or  reject  a  state  transition  depends  on  the  value 
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of  A’d,  and  therefore  on  the  previous  decision.  The  computation 
can  be  made  parallel  by  using  a  lattice  of  demons.  Temperature 
is  still  measured  with  (9),  but  using  the  distribution  of  Ed  over 
space  rather  than  time.2 

There  is  a  minor  complication  in  using  a  lattice  of  demons. 
The  single-demon  algorithm  visits  sites  at  random  and  the  de¬ 
mon  allows  energy  to  be  transferred  throughout  the  lattice.  Sim¬ 
ilarly,  in  the  lattice-of -demons  algorithm  the  demons  must  be 
mixed  throughout  the  lattice.  If  this  is  not  done  energy  will  he 
transferred  between  sites  very  slowly,  only  through  the  nearest- 
neighbor  interactions  of  (6).  We  use  a  complete  random  permu¬ 
tation  of  the  demons  after  every  lattice  update,  but  more  local 
methods  are  also  adequate. 

Microcanonical  annealing  has  several  advantages  over  stan¬ 
dard  annealing: 

1.  It  does  not  require  the  evaluation  of  the  transcendental 
function  exp(j).  Of  course,  in  practice  this  function  can 
be  stored  in  a  table,  but  we  would  like  our  algorithm  to  be 
suited  to  fine-grained  cellular  automaton  with  very  limited 
local  memory. 

2.  It  is  easily  implemented  with  low-precision  integer  arith¬ 
metic;  again,  a  significant  advantage  for  simple  hardware 
implementation. 

3.  In  the  Metropolis  algorithm  a  state  transition  is  accepted  or 
rejected  by  tomparing  exp {  —  AE/kT)  to  a  random  number 
drawn  from  a  uniform  distribution  over  [9,1],  and  these 
numbers  must  be  accurate  to  high  precision.  The  Creutz 
algorithm  does  not  require  high-quality  random  numbers. 

Experiments  indicate  that  the  Creutz  method  can  be  pro¬ 
grammed  to  run  an  order  of  magnitude  faster  than  the  con¬ 
ventional  Metropolis  method  for  discrete  systems  [IS]. 

In  standard  annealing  it  is  not  clear  how  to  determine  when 
the  system  reaches  equilibrium.  One  can  examine  fluctuations 
in  the  average  energy,  which  should  be  of  order  1  / M 2  at  equilib¬ 
rium,  but  this  may  require  many  extra  iterations  to  get  adequate 
statistics  because  one  does  not  know  in  advance  what  average 
value  to  expect.  In  microcanonical  annealing  there  is  a  simpler 
way.  Let  rrq  be  the  ratio  of  the  observed  average  demon  energy 
to  the  standard  deviation  of  the  same  observed  distribution: 


At  equilibrium  rriJ  =b  1. 

As  with  the  Metropolis  algorithm,  the  Creutz  algorithm  con¬ 
verges  to  the  Boltzmann  distribution  in  the  limit  lor  any  ergodic 
piocess  generating  candidate  state  transitions.  Of  course,  differ¬ 
ent  state-transitions  schemes  will  affect  the  rate  of  convergence. 
VV’e  have  found  the  following  simple  method  to  be  adequate; 


Er(d  —  d’) 


.5  if  [</  -  c/'i  =  1 

0  otherwise 


In  other  words,  the  disparities  increase  or  decrease  by  one  In  I  lice 
position  as  the  system  follows  a  Brownian  path  on  its  phase- 
space  surface  of  rn-  *  ”*  ■  ■  -"v  tv.;,,  o,  -  random  nit  need  be 

generated  for  each  transition. 

^Statistics  can  b p  sampled  over  both  time*  and  space,  if  desired. 


4  Experimental  Results 

This  section  presents  experimental  results  for  three  distinct 
cases:  a  sparse  random-dot  stereogram,  a  high-resolution  aerial 
stereo  pair,  and  a  medium-resolution,  oblique,  ground-level 
scene  with  prominent  occlusions.  The  method  has  been  tested 
on  over  30  real  images;  these  examples  have  been  chosen  to  in¬ 
dicate  a  variety  of  conditions.  Identical  parameters  were  used 
for  all  three  cases.  Four  nearest  neighbors  were  used  for  A  .  We 
used  a  value  of  50  for  A.  which  works  well  for  images  quantized 
into  eight-bit  values.  A  schedule  for  heating  and  cooling  was  es¬ 
tablished  to  yield  about  -100  complete  scans  at  each  scale,  with 
about  90  percent  of  the  cycle  devoted  to  cooling.  The  precise 
number  of  scans  varies  because  during  cooling  Lite  system  adapts 
the  schedule  to  stay  near  equilibrium. 

Figure  4  shows  the  results  for  a  10%  random-dot  stereogram 
with  four  depth  planes  separated  by  intervals  of  2  pixels  of  dis¬ 
parity.  The  three  graphs  trace  the  evolution  of  temperature, 
average  energy  per  lattice  site,  and  req.  Note  that  the  plot  of 
req  indicates  that  the  system  moves  away  from  equilibrium  dur¬ 
ing  the  relatively  fast  heating  cycles,  but  relaxes  quickly  back 
to  equilibrium  after  cooling  starts.  The  system  appears  to  drop 
away  from  equilibrium  at  low  temperatures  according  to  the  req 
plot,  but  this  effect  is  actually  because  there  are  very  few  energy 
levels  available  to  the  demons  near  the  ground  state. 

Figure  5  sltows  results  from  a  512  x  512  aerial  stereo  pair 
with  a  disparity  range  of  72  pixels.  This  is  the  largest  problem 
wo  have  attempted  so  f,.r.  The  disparity  map,  shown  in  the 
lower  left,  is  also  shown  to  the  right  with  contours  at  every 
fifth  disparity  level.  Inspection  of  this  map  indicates  that  it  is 
accurate  to  1  pixel  over  the  entire  field. 

The  stereo  pair  in  Figure  6  is  distinguished  by  sharp  discon¬ 
tinuities  of  depth.  (Contours  are  shown  at  every  third  disparity 
level.)  The  method  has  done  a  reasonably  good  job  of  separat¬ 
ing  the  tree  from  the  background,  which  is  somewhat  surprising 
since  it  has  no  explicit  representation  for  occlusions,  in  an  at¬ 
tempt  to  model  occlusions  we  have  experimented  with  line  pro¬ 
cesses,  like  the  one  used  by  Goman  and  Goman  [12],  and  with 
non-linear  “springs"  that  weaken  as  they  deform.  Our  results  so 
far  have  not  justified  the  added  complexity. 

In  earlier  work  [13,19]  we  used  the  absolute  difference  instead 
of  the  squared  difference  in  (3).  This  is  slightly  more  efficient  to 
implement  and  makes  little  difference  in  the  results.  In  terms  of 
the  spring  model,  this  would  correspond  to  using  “springs"  that 
exert  a  constant  force  in  opposite  direction  of  their  deformation. 
Of  course,  a  different  value  of  A  is  required.  We  have  also  ex¬ 
perimented  with  eight-neighbor  versions  of  A’,  but  no  significant 
improvement  in  performance  was  observed. 

5  Conclusions 

The  major  conclusion  we  can  draw  is  that  ( 1 )  is  an  adequate 
criterion  for  stereo  matching,  even  in  scenes  with  abrupt  oc¬ 
clusions.  The  results  in  Figure  (>  indicate  that  solutions  can 
accommodate  very  abrupt,  changes  in  depth.  Residual  energy 
in  the  near-optimal  states  is  concentrated  along  steep  disparity 
contours  rather  than  spread  over  bu-g<>  areas.  The  ivciopean 
representation  ensures  that  there  will  he  no  "unseen*'  areas  with 
undefined  disparity. 

The  use  of  a  scale  hierarchy  dramatically  increases  the  effi¬ 
ciency  of  the  method,  especially  for  large  problems  such  as  that 
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illustrated  in  Figure  5.  An  additional  henolit  of  using  a  scale 
hierarchy  is  that  the  solution  is  loss  sensitive  to  small  amounts 
of  vertical  disparity,  which  is  eliminated  at  coarser  scales  .  (Un¬ 
certainty  in  the  camera  model  will  usually  cause  some  vertical 
disparity  in  high-resolution  images.)  A  gaussian  low-pass  hi¬ 
erarchy  works  as  well  as  the  laplarian  hierarchy  if  the  images 
are  recorded  with  equivalent  sensors.  The  benefit  of  bandpass 
f.Ileiing  ir  to  eliminate  the  low-frequency  variation  caused  by 
uncalibrated  photometry. 

Annealing  provides  a  wav  to  bridge  the  gap  between  scales. 
The  microcanonical  annealing  algorithm  appears  to  be  an  im¬ 
provement  over  canonical  annealing  for  reasons  discussed  in  Sec¬ 
tion  3.3.  It  is  certainly  much  easier  to  implement  in  cellular 
automata.  The  theoretical  results  showing  convergence  in  finite 
time  do  not  necessarily  carry  over  to  microcanonical  annealing, 
but  the  requirements  of  these  results  are  never  met  in  practice 
anyway. 

Canonical  annealing  and  “pure”  single-demon  microcanonical 
annealing  are  at  opposite  ends  of  a  spectrum.  In  canonical  an¬ 
nealing  the  heat  bath  is  much  larger  than  the  model  system,  and 
is  not  represented  explicitly.  In  pure  microcanonical  annealing 
the  heat  bath  that  is,  the  single  demon  is  much  smaller 
that  the  system,  and  it.  is  represented  explicitly.  The  lattice-of- 
demons  algorithm  is  midway  between  these  extremes,  with  the 
heat  bath  and  the  model  system  having  comparable  sizes.  In  a 
sense,  this  is  a  classical  space/time  tradeoff.  Ily  representing  the 
heat  bath  explicitly  we  can  avoid  the  evaluation  of  complicated 
functions. 

Comparison  with  the  scale-space  continuation  method  is  dif¬ 
ficult  because  the  nature  of  the  data  affects  the  smoothness  of 
the  energy  landscape.  In  some  stereo  pairs  tfie  data  will  be  so 
clear  that  this  more  direct  form  of  optimization  will  work  well. 
(  For  example,  dense,  random,  greyscale  stereograms  with  inten¬ 
sities  chosen  over  a  broad  range  of  values  rail  be  solved  even 
by  a  “greedy”  algorithm;  that  is,  a  Monte  Carlo  optimization 
accepting  only  transitions  that  lower  energy.)  Both  methods 
use  multi-scale  representations,  but  the  stochastic  approach  uses 
multiple  scales  for  efficiency,  while  the  gradient  descent  method 
uses  them  in  an  attempt  to  impose  convexity.  A  comparative 
study  is  needed  to  determine  when  the  additional  overhead  of 
annealing  is  justified. 
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ABSTRACT 

This  paper  concentrates  on  the  problem  of  obtaining  depth  infor¬ 
mation  from  binocular  disparities.  It  is  motivated  by  the  fact  that 
implementing  registration  algorithms  and  using  the  results  for  depth 
computations  is  hard  in  practice  with  real  images  due  to  noise  and 
quantization  errors.  We  will  show  that  qualitative  depth  information 
can  be  obtained  from  stereo  disparities  with  almost  no  computations, 
and  with  no  prior  knowledge  (or  computation)  of  camera  parameters. 
The  only  const  raint  is  that  the  epipolar  plane  of  t  he  fixation  point  in¬ 
cludes  the  X  -axes  of  bot  h  cameras.  We  derive  two  expressions  which 
order  all  matched  points  in  the  images  in  two  distinct  depth-consistent 
fashions  from  image  coordinates  only.  One  is  a  t  ilt-rclatod  order  A, 
"•‘rich  depends  only  on  the  polar  angles  of  the  matched  points,  the 
other  is  a  depth-related  order  \.  Using  A  for  tilt  estimation  and  point 
separation  (in  depth)  demonstrates  some  anomalies  arid  unusual  char¬ 
acteristics  which  have  been  observed  in  psychophysical  experiments. 
Furthermore,  the  same  approach  can  he  applied  to  estimate  some 
qualitative  behavior  of  the  normal  to  the  surface  of  any  object  in  the 
field  of  view.  More  specifically,  one  can  follow  changes  in  the  cur¬ 
vature  of  a  contour  on  the  surface  of  an  object,  with  either  x-  or 
[/-coordinate  fixed 


INTRODUCTION 

Research  in  early  vision  regarding  stereo  seems  to  be  concerned 
mainly  with  the  correspondence  problem,  namely,  finding  (he 
right  matching  between  points  on  the  left  and  right  images.  Ob¬ 
taining  exact  depth  values  from  a. stereo  pair  has  been  considered 
a  simple  exercise,  whose  solution  is  well  known,  though  might 
involve  some  tedious  but  trivial  computations.  Thus,  it  has  been 
implicitly  assumed  that  the  final  goal  of  stereo  algorithms  is  to 
compute  exact  depth  map  using  disparity  values.  The  following 
observations  suggest,  however,  that  depth  computation  from  dis¬ 
parity  values  is  not  necessarily  straightforward  or  even  feasible, 
and  that  more  qualitative  depth  information  may  be  easier  to 
obtain  and  more  robust. 

First,  the  depth  computation  problem  reduces  to  a  simple 
trigonometric  formula  when  the  parameters  of  the  cameras,  or 
the  eyes,  are  known.  When  they  are  not  known,  a  scheme  to 
compute  camera's  parameters  from  a  number  of  conjugate  points 
(that  is,  matched  pairs  of  points  from  the  different  images)  has 
been  devised,  involving  the  solution  of  a  set  of  nonlinear  equa¬ 
tions  (see  for  instance  Horn,  IflRfi).  Since  the  problem  has  no 
closed  form  solution,  and  since  the  data  are  not  precise,  a  sola 
lion  is  found  using  iterative  methods  that  minimize  the  sum  of 


the  squares  of  the  errors.  In  practice,  however,  this  approach  is 
very  difficult  to  implement,  since  the  parameters  of  t lie  cameras 
must  be  obtained  from  data  with  error  in  the  order  of  magni¬ 
tude  of  the  disparity  values,  which  are  the  raw  material  used 
for  depth  computation  (e.g.,  error  due  to  pixel  quantization).  In 
other  words,  the  registration  problem  (namely,  finding  param¬ 
eters  of  cameras  calibration)  is  much  more  difficult  than  just 
computing  depth  from  disparity  values.  Less  general  methods 
to  find  camera  calibration  have  also  been  devised,  see  Prazdnv 
(1981)  and  Longuet- Higgins  (1981). 

The  olher  observation  originates  from  biological  vision.  It 
seems  that  human  vision  does  not  necessarily  obtain  exact  depth 
values  from  stereo  disparity  information  alone,  see,  e.g.,  Foley 
(1977)  and  Foley  &  Richards  (1972).  Rather,  stereo  disparity 
seems  to  be  used  mainly  in  obtaining  qualitative  depth  informa¬ 
tion  about  objects  in  the  field  of  view.  Estimation  of  the  magni¬ 
tude  of  this  relative  depth  is  possibly  dependent  on  extraretinal 
estimation  of  some  physical  parameters  like  angle  of  convergence 
of  the  eyes.  For  example,  whether  looking  at  stereograms  with 
crossed  or  uncrossed  eyes  affects  only  the  extraretinal  percep¬ 
tion  of  the  angle  of  convergence  of  the  eyes,  not  the  disparity 
values.  It  also  results  in  a  different  perception  of  the  depth  of 
the  central  square  in  a  simple  random-dot  stereogram  (where  a 
central  square  in  one  image  has  a  constant  shift  with  respect  to 
the  other). 

The  purpose  of  this  paper  is  to  exploit  the  geometry  of  the 
situation,  where  a  scene  is  viewed  from  two  different  angles,  to 
get  insight  into  the  above  problem.  It  will  be  shown  that  quali¬ 
tative  relative  depth  information  (order)  of  various  kinds  ran  be 
obtained  from  conjugate  points  alone  easily  and  reliably,  involv¬ 
ing  almost  no  computations  and  independent  of  camera's  pa¬ 
rameters.  Those  orders  will  demonstrate  some  anomalies  which 
are  observed  in  human  psychophysics  and  presently  lack  olher 
straightforward  explanation. 

The  exact  order  expressions  will  scale  in  proportion  to  the 
angle  of  convergence  between  t ho  two  cameras.  The  exact  rela¬ 
tive  dept  li  can  he  computed  from  those  orders  using  few  matched 
points  and  some  approximated  numeric  scheme,  or  using  more 
Ilian  two  images.  Alternatively,  il  can  he  estimated  from  some 
external  estimation  of  the  physical  quantifies  involved,  namely 
-  the  angle  of  convergence  and  the  angle  of  gaze,  in  agreement 
with  the  psychophysical  theory  suggested  in  Foley  (1977). 
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indicate  whether  P.  relative  to  the  fixation  point,  is  tilted  away 
from  the  cameras’  baseline  (ct  >  0“),  “parallel"  to  the  baseline 
(a  as  0°),  or  towards  the  cameras’  baseline  (o  <  0°). 

The  order  expression  A  has  been  defined  as  a  function  of  the 

polar  angle  -0  only  in  both  eyes.  This  is  especially  convenient 

since  the  polar  angle  is  preserved  under  projection  onto  either  a 

spherical  body  (the  eye)  or  a  planar  body  (a  camera).  However, 

it  might  prove  useful  to  examine  A  as  a  function  of  the  Cartesian 

coordinates  (xt.yi)  and  (xr.yr)  in  both  images,  assuming  planar 

projection.  In  this  case: 

cot  dr  xT/x, 

In  A  =  In -  =  In  — - — 

cot  0l  yr/yi 

=  (In  xr  —  !ni()  —  (In  yr  —  In  yi)  ^ 

=  A(ln  x)  -  A(ln  y ). 

In  other  words,  if  any  matching  algorithm  is  applied  to  the  out¬ 
put  images  of  the  transformation  7  :  (ir.y)  — •  ( In  j-,  ! n  y)  per¬ 
formed  on  the  original  images,  and  the  disparity  vector  (A,..  A, ,) 
is  then  computed  in  the  usual  way,  then  the  difference  Ar  -  Aj,  = 
In  A  is  an  order  of  the  same  type  as  A,  with  no  need  for  any  addi¬ 
tional  computation.  However,  the  transformation  /'  is  singular 
near  the  vertical  and  horizontal  meridians. 

An  interesting  feature  of  the  order  A  is  its  good  agreement 
with  some  characteristics  of  human  vision,  especially  under  the 
unrealistic  conditions  of  the  “induced-effect”  which  is  predicted 
by  using  A  (see  Weinshall.  IDS”). 

DEPTH-RELATED  ORDER 

From  equation  (2)  one  can  obtain  an  explicit  expression  for  the 
depth  .'  of  a  point  relative  to  the  fixation  point  (the  origin). 
First,  note  that  (2)  implies 

c  =  (cot  i)r  -  cot  i/|)  •  — - — (at 
2  sill  // 

Thus.  \#  =  (cot  i>r  -  cot  i>i)  gives  exact  depth  order  on  all  the 
points  in  spare  with  some  constant  height  y  over  the  base  plane. 
It  follows  that  this  order  is  most  useful  to  compare  points  which 
differ  mainly  ill  their  x-coordinate  with  y  approximately  the 
same.  The  previous  order  A.  on  the  other  hand,  was  most  useful 
to  compare  points  which  differed  mainly  in  their  y  coordinate 
with  .r  approximately  the  same. 

Next,  let  ns  derive  an  expression  for  c  which  depends  only 
on  scene  ami  camera’s  parameters.  In  the  appendix  it  is  shown 
that,  for  i>  the  angle  of  gaze.  /-  the  interocular  distance  and.  h 
the  focal  length  of  the  cameras. 


f  /  cos(  /l 

l  sin  2| 


f  ros(/i  f  r) 


.r  sin  /I  t  cos  p 


,  .  +  xsin/i  +  ,:cos/if  •  .  . 

(  sin  2/i  J  " 

Substituting  (t>)  in  (5)  for  a  point  in  the  right  hemiliehl.  (((>') 

will  be  used  otherwise),  gives: 

( cos(  p  -  r  )f  sin  2/i 


Thus,  for  an  angle  of  convergence  2  p  small  enough  so  that  2/;  » 
|tanp(xr  +  ^-xi)|,  we  get  a  relative  depth  order  \  on  all  the 
points  in  the  visual  field,  where 

yr 

\  =  Xr - X,. 

yi 

As  will  be  shown  in  the  section  of  error  analysis.  ^  =  I  + 

Hi 

0{ft).  Likewise,  since  the  field  of  view  is  mechanically  bounded 
by  some  2£  <  1K()°,  it  follows  that  x  <  h  tan  Thus,  a  sufficient 
condition  for  the  appropriateness  of  to  a  first  order  in  //.  is 
l  >  tan  fi- tan  £.  If  2(  <  90°.  which  is  a  reasonable  upper  bound, 
then  it  is  sufficient  if  1  >>  tan//,  or  2//  <C  90‘\  To  illustrate,  the 
distance  to  the  point  of  fixation  should  be  much  greater  than  3 
cm  for  an  average  person  looking  straight  ahead,  possibly  30  cm 
or  more. 

We  have  got,  t  hen,  a  relative  depth  order  which  is  the  tradi¬ 
tional  x-disparity  corrected  for  non  zero  vergence  (angle  of  gaze 
//  not  0)  and  some  field  location  (jr-coordinate)  distortion.  How¬ 
ever,  for  a  fixed  convergence  angle  2//,  this  order  has  some  dis¬ 
tortion  relative  to  the  physical  relative  depth,  which  increases 
with  the  horizontal  distance  from  the  point  of  fixation  (the  x 
coordinate). 

QUALITATIVE  SHAPE  FROM  STEREO 

The  triple  (n,;f,2)  as  point  representation,  and  equations  (1) 
and  (2),  turn  out  to  be  useful  for  surface  normal  analysis.  For 
any  two  points  I\  and  where  I\  =  "i(tanli,>  «  HinT*  H  «uid 

A  =  '^r57T777- r^TTTJ-  1  >•  l<‘>  *V  =  A  *  A-  V  is  yoryemhehr 
to  (I j  -  /A).  (It  is  actually  proportional  to  tlio  normal  to  t ho 
plain1  passing  through  l\.  /■>.  and  the  fixation  point.)  After 
some  calculations,  it  can  ho  shown  that 
-  col  li  1  -  cot  .il  cot  III  -  cot  fl| 

cot  CV|  cot  Tj  —  cot  o-  cot  lil  cot  fl|  cot  ,ij  -  cot  ti  l  cot  h 

.  1) 

o].., .  J-W(.y;,d;,.d;.  ,>2r) .  l). 


(cot  Oj  -  cot  0j  )  -  (cot  Oj  -  cot  dj. ) 
(cot  Oj.  -  cot  Oj. )  4-  (cot  Oj  —  cot  Oj ) 
cot  Oj.  cot  Oj  -  cot  Oj  cot  d,l 


q(  0] .  0\ ,  O  j ,  0  j  )  = - — - - -ri - - ^ - - - r  . 

Thus,  as  long  as  /(?/]  ,0‘r,  0j ,  ?^')  and  g{  dj ,  dj..  0j .  0~ )  remain 
constant,  which  can  be  determined  from  imago  coordinates  only, 
the  points  are  coplanar  (among  themselves  and  with  the  fixation 
point),  or  the  object  at  the  center  of  gaze  is  planar.  Note  that 
A  is  obtained  from  /  when  cot  Oj  =  cot  Oj  --  0  (//  =  0  then). 

Moreover,  for  any  object  it  is  possible  to  obtain  qualitative 
information  about  its  surface  along  any  contour,  with  either  j* 
or  y  fixed.  'Fake  a  contour  on  tin*  surface  with  some  fixed  i/ 
coordinate,  and  let  and  /*>  be  two  points  on  it.  Since  the 
;/  coordinate  of  l\  -  / b  is  0.  the  projection  of  A  on  the  Y  -  / 
plane,  n  —  ~(-JL~f{0j.0l'0j.0j)  .1).  is  perpendicular  to  the 
projection  of  l\  -  /V  I' h us.  for  fixed  //.  the  one  dimensional 


I 

& 


ftwfcNLi 


lion mlary  contour  is  convex  when  /(rfj ,  i)',  ilj ,  i)~ )  increases  wit  h 
increasing  x,  concave  when  f(i/j ,  i>',  Hj,  )  decreases,  and  tin 
ear  when  /( ri] .  itj,  ilj. )  remains  constant.  Note  t  hat  \v  can  he 
obtained  from  /  since  the  sign  of  /  determines  relative  depth  be 
tween  two  points  with  fixed  y  coordinate.  The  same  qualitative 
description  can  be  obtained  for  any  boundary  contour  with  fixed 
x  from  following i)'r,ilj ,i>‘)  with  increasing  y.  x  fixed.  This 
qualitative  description  depends  on  image  coordinates  only,  more 
specifically  on  polar  angles  of  the  conjugate  points.  Obtaining 
this  description  is  not  t rivial,  t  hough,  since  such  a  contour  in  t lie 
world  coordinate  system  will  be  usually  mapped  to  an  oblique 
line  in  the  image  plane  due  to  convergence. 

In  the  general  case,  the  normal  to  a  plane  passing  through 
11101'  points  in  space  !'•,  and  depends  on  the  image  coor 
dinatcs  and  the  angle  of  convergence  //  in  a  more  complex  way. 
so  that  fi  should  be  known  to  compute  the  (exact)  d  1)  nor 
null.  (The  final  length  li  should  be  known  as  well.)  However,  if 
-  /\,  =  (I).;/.-')  and  f’i  -  /’a  =  (x.O.x)  or  vice  versa,  the  nor¬ 
mal  can  be  computed  from  the  above  argument  (  up  to  a  scaling 
factor  of  the  x  and  y-coordinates,  depending  on  ph 

Alternatively,  one  can  estimate  the  normal  to  the  plane 
passing  between  three  points  I),  1  '2 .  and  /*t  to  a  first  order 

in  p  and  the  x-disparity  -yirdlhr-  In  this  case,  alter  suhst it  utiug 
A',,,,/,  -(Jr  -  ^7ti)  an  approximation  for  .:.  where  A  is  some 
constant  which  depends  on  /< .  e.  and  li,  one  gets  an  expression 
for  the  general  normal  V,,.; 

•V,;  ■■(/',-  r,  )  1  (l\  -  )\) 

«»■„,(,(  V',uw  A( x| ,  xj .  nl .  y} .  x-.xf ,  a; ,  yf .  x:'r.  x)'. yf). 

I < »Vr.  J •}  •  •  y}  ,K-xj.  y; .  yf .  J-;!.  xf.  i/,*. 1 ) 

where  U',,,,/,,  and  are  some  constants  which  depend 

on  ft.  r  and  h.  /*’()  amt  U’()  an*  some  functions  of  imapv  coor 
(ii nates  only.  Once  again,  one  can  verify  planarity  of  surfaces  of 
objects  in  tin’  field  of  view  when  /•  ami  (S  remain  constant. 


NUMERICAL  OOMF’UTATION 

l  et  us  compute  flu*  exact  tilt  and  depth  value  to  a  first  order  in 
t he  convergence  angle  2/i.  following  Mayhew  Ac  I  onguet  Higgins 
(  HN2)  method  to  compute  tilt  and  slant  of  a  plane  through  the 
fixation  point.  The  following  scheme,  however,  will  be  simpler 
and  involve  less  and  more  rigorous  assumptions  (we  shall  only 
assume  small  2//  as  implied  above).  Since,  to  a  first  order  in  //. 
Ian  2//  ~  -  ,  where  If  is  the  distance  between  the  lixation 

point  and  the  midpoint  of  the  interocular  line  (the  nose),  our 
computations  will  be  to  a  lirst  order  in  (77) 

Let  (.r.  y)  and  { .r ' .  1/)  denote  1  he  image  coordinates  of  .1  cer 
tain  point  in  space  on  the  two  cameras  r«*spectively.  Let  n  and 
J  deno  e  the  parameters  of  a  plane  that  passes  through  a  given 
point  in  space  and  the  fixation  point  in  the  above  coordinate 
-■ystem.  so  that  Z  -  d.Y  t  J>  .  I  hus  d  is  lan( o )  in  » he  previous 
notations  if  ,i  -  0  and  .1  is  tan(T)  if  o  R 

ffien,  to  a  first  order  in  /1.  we  have  ( I  onguef  Higgins  Ac 


Pra/dny,  1!)N0) 

Ax  -  x'  -  x  -[(dcos(i')  -f-  si  11  f /' ) ).r  f  .Jcos(f').V  I  |c«is(r) 

-  osin(i^))x‘  -  ti s'u\(r)r y\  ■  I/I f. 

A y  ~  }f  -  y  -|siu(/')//  +  (eos(/')  dsin {r))xy 

-  iisin(  1/)//“]  ■  I /  II. 

(The  coordinate  system  used  to  obtain  (  7)  is  t  he  cydopean  coor 
dinal. •  system.  This,  however,  doesn't  change  the  results  when 
changing  to  our  coordinate  system  since  the  angle  bisector  and 
the  median  are  the  same  line  to  a  first  order  in  //  and  the  trans 
Li t  ion  of  the  origin  has  been  taken  inti)  account  in  I  be  definition 
of  the  target  plane). 

For  a  given  point  in  space,  one  can.  in  the  more  interesting 
case's,  take  the  plane  passing  through  it  and  the  fixation  point 
which  is  perpendicular  to  the  base  plane,  for  which  ,1  -  0.  This 
plain*  would  be  determined  only  by  d  (tin*  plane  perpendicular 
to  the  base  plane,  unique  usually.)  Thus,  we  have 

=  f$in(/')  +  (cos(l')  -  d  sin(r)  )j*]  •  I /  II 

y  (x) 

-  [tan(r )  +  ( 1  d  lan(//))j*|  ■  lan(2//). 

Let  (  C|  ,  //I  ,  A.rj  ,  A,Vi  )  lx*  the  coordinates  of  a  point  on  tin* 
vertical  meridian,  so  that  jq  %  0.  Then  we  have 
A  Vi 

-  -  tan(/')  •  tan(2//). 

!/\ 


(Recall  that 


tan(i/)  •  taii{//)  always.) 


Let  (  x  *  ,  y>  ,  Ax 2  *  A//2  )  b<*  t  lie  coordinates  of  a  point  with 
d  %  0.  Such  a  point.,  if  exists,  can  be  easily  identified  since  it 
satisfies  ^  Then  wo  have: 


x  ,  •  tan(2/<)  -  — 

l 

In  ot  her  words. 


*an(/')  •  tan(2//)  -- 


A//.-  A//i  a',  y\ 


1  ,  //'-  .vi . 


lanC>/()  =  —  ■-!-}. 

x>  ;/.•  vi 

Now.  for  any  point  (x,y)  in  the  image  we  have,  using  (7)  wit 

.1  t  0: 

s'  U  Ax  Ay  . 

—  ■  —  =  -  -  — —  I  cos (r)/ II  -  o  ■  tan(2//). 

■r  y  r  ;/ 

I  his  leads  to  the  final  equations: 

HinC-V)  —  •[—  — |.  ( 


MnCJ/O 


;  l:m(/-) 


1.111(1!/') 


I  he  ratio  near  the  vertical  meridian  is  relativeh  leliabh* 
and  easy  to  olitain.  However,  a  point  with  n  ^  0  does  not 
necessarily  exist,  in  whi<  h  case  we  can: 

I  Follow  Mayhew  X*  i.onguet  Higgins  (li>S2)  and  neglei  l  tin’ 
term  o  tan(/').  but  not  the  tern1  tati(r). 

un  r-V)  i-T  — !■  I--') 


ii 

v*Vi 


'  X.  X. v,  s  X 

'  V.V.  x.  X 


1,*  v*  V  V"  v‘  C 


If  we  neglect  tan(/c),  for  consistency,  we  get: 

1  -V 

tan(2/t)  =  —  ■ - . 

1  y 


2.  Solve  the  initial  scheme  without  such  a  point.  Given  a  vertical- 
meridian  point,  there  remains  a  fairly  simple  equation  to 
solve.  This  would  be: 

_*.^L  =  I. ,£-»•) 

!/,  x  y  vi  (K)) 

x1  y ' 

tan(2 ft)  — - , 

T  U 

which  reduces,  after  substituting  tan(2p)  from  the  second 
equation  in  the  first  equation,  to  a  second-degree  polynomial 


A  different  numerical  approach  would  be  to  use,  for  exam¬ 
ple.  three  images  taken  while  moving  on  the  base  plane.  Denote 
by  (j*o,  i/o ),  ( j|,  yi )  and  (xj.jfc)  the  coordinates  of  the  conjugate 
projections  of  some  point  P  on  the  throe  images.  Denote  by  oj , 
//,  and  v\  the  angle  of  tilt,  half  the  angle  of  convergence,  and 
the  angle  of  gaze  respectively  in  the  coordinate  system  defined 
as  above  by  the  first  two  images.  Denote  by  o>,  ;/>  and  v2  the 
same  angles  in  the  coordinate  system  defined  by  the  last  two 
images  (see  figure  2).  For  motion  on  a  straight  line  we  have: 


f-  2  1'hreo  ) maul's  from  motion  on  tin*  base  piano 


tan  n |  •  tan  ft\  - 


tan  v\  •  tan  ft  j  = 


tan  oc'i  •  tan  fio 


tan  // 2  •  tan  ft2  = 


A\?o*yo*j:uyi )  -  1 

+  1 

i  —  y*(r=0} 
y0('x  =  0j 

,  ,  yt(rssO) 

1  + 

Mxuys.x2,!i2)  -  1 

MTi.yi,xt.yi)  +  1 
1  _ 

Vi  f  J  =  0) 

i 

1  ^  y,<r  =  0) 


U2  -  V\  —  fl\  +  ft2* 

where  is  the  V-axis  ratio  on  the  vertical  meridian  ( x  =  0} 


if .  \  *  —  » 

between  conjugate  points  in  images  i  and  j. 

Thus  we  have  six  nonlinear  equations  with  six  unknowns. 
For  small  /o’ s  we  have  approximately  a  linear  problem,  where 
the  solution  is  a  mill-vector  of  the  approximating  matrix.  We 
will  obtain  a  very  similar  set  of  equations  if  we  take  two  points 
in  tlie  three  images  and  ignore  the  equations  involving  the  angle 
of  gaze  r.  In  this  case  the  motion  in  the  base  plane  does  not 
have  to  be  in  a  straight  line. 

ERROR  ANALYSIS 

First,  from  the  definition  of  A  and  \  it  follows  that  the  base  plane 
itself  is  singular  in  the  sense  that  these  orders  are  not  defined  for 
points  on  it.  The  same  problem  exists  in  the  analysis  of  normals 
to  surfaces  of  objects.  One  can,  however,  estimate  the  orders 
and  normals  by  substituting  of  a  matched  point  far  from  the 
base  plane.  More  specifically,  for  l’  =  (x,y.z)  v>e  have 
yr  _  flt  z  2  sing  tan  f  t  x  2  sin//  t  x  z  , 


y r  _  <li  +  z  x sin /Manic  ^  x  Xsin//  +  n  1  '  ) 

yi  dr  dr  1  +  tan  //  tan  v  dr  1  +  tan  //  tan  tc  d,  '  dr 


I  2  tan// tan  t/  z  2  sin /t  tan  p  ^ 

1  +  tan  //  tan  v  dr  1  +  tan  //  tan  t/ 

1  2sln/'  \  n(  X  —) 

dr  1  +  tan  //tan//  dr  '  dr 

Thus,  if  point  l,J  is  used  to  approximate  point  l‘\  the  error  will 
lie: 

ftv  -  M  =  -  ,an v + — -j— — 1 . 

V  yi  J  \Vi  /  l  +  tan  ft  tan  v  ar  dT 

Tlu*  error  is  0  if  the  aproximating  point  PJ  lies  exa<  tly  “above” 
l)l  (differs  only  in  the  y-coordinate). 

Moreover,  one  can  use  as  an  estimation  1  ~fan  11  ■■  ( the  first 

two  terms),  so  that  some  (possibly  extraretinal)  estimation  of  ft 
(half  the  angle  of  convergence)  and  v  (the  angle  of  gaze)  will 
suffice  to  give  a  rough  estimation  to  when  no  other  source  of 

information  is  available.  Note  that  one  can't  take  ^  1  when 

\n 

computing  xr  -  as  a  first  order  approximation  in  //.  since 
xr  -  xi  is  of  the  order  of  magnitude  of  ft  also. 

Second,  let  us  consider  the  violation  of  the  basic  assump¬ 
tion,  namely,  that,  the  A -axes  themselves  of  both  cameras  are 
epi polar  lines.  This  introduces  an  error  />r  and  f*t  in  the  polar 
angle  of  a  given  point's  projections  on  the  right  and  left  images 


,V-N>  V  V  V  v  Vv  V  v  V  /  V  %*  ■ 


respectively.  Thus,  the  true  orders  should  be: 
cot(i?r+Ar) 


0)1(1?!  +  A,) 

cot  i?r  ^  cot  t?r  ^  tan  i 
cot  i)j  cos2  i ?!  sin2 

\u  =  cot  ( i?r  +  ir)  —  cot(i?|  +  Si) 


■fr+0(S), 


=  (cot  !?r  -  cot  l?|) - .  y  ■  fir  +  -  .  -  •  fit  +  o(6). 

sin  i? r  sin"  t ?! 

where  fir.fii  <  fi.  The  main  conclusion  from  this  is  that  the  effect 
of  axes  misallignment  is  greater  near  the  horizontal  and  vertical 
meridians,  and  possibly  negligible  further  away.  Also,  this  error 
affects  less  the  expressions  for  qualitative  shape  (the  normal  to 
iso  c  or  iso-jt  surface  contours),  since  they  involve  differences 
where  this  error  is  somewhat  cancelled  out  for  the  two  points. 

SUMMARY 

1  he  goal  of  this  work  had  been  to  obtain  qualitative  information 
from  a  stereo  pair,  with  as  few  computations  as  possible  and  min¬ 
imal  dependence  on  cameras'  and  scene's  parameters.  We  have 
shown  that  points  in  a  stereo  pair,  once  matched  to  each  other, 
can  be  ordered  according  to  twodistinct  order  expressions:  a  tilt- 
related  order  A,  which  is  roughly  a  relative  depth  order  when 
ordering  points  with  only  vertical  displacement,  and  a  depth- 
related  order  \  which  is  best  to  order  points  with  only  horizontal 
displacement.  These  orders  are  completely  determined  by  image 
coordinates  of  conjugate  points,  no  camera  or  scene  parameters 
are  needed  (wdtich  need  not  ue  similar  for  both  cameras).  A  and 
some  variation  oi  \  i\v)  depend  only  on  the  polar  angles  of  the 
conjugate  points  in  both  images,  a  quantity  which  is  preserved 
under  projection  to  a  spherical  body  (an  eye)  or  a  planar  body  (a 
camera).  Moreover,  given  the  polar  angles  of  the  images  coordi¬ 
nates,  some  qualitative  shape  information  can  be  obtained:  one 
ran  follow  changes  in  the  curvature  of  a  contour  on  the  surface 
of  any  object  in  the  field  of  view,  with  either  .r-  or  (/-coordinate 
fixed.  We  demonstrated,  by  further  analyzing  the  exact  equa¬ 
tions.  t  hat  obtaining  t  he  quantitative  informal  ion  is  much  harder 
and  less  reliable  than  obtaining  the  qualitative  one.  e.g..  orders 
like  A  and  \.  It  usually  involves  some  assumptions  on  the  scene 
or  extra-retinal  information,  pins  a  lot  of  computations.  These 
computations  tend  to  lie  less  robust  and  sensitive  to  noise  and 


APPENDIX 

<  'onsider  t  lie  base  plane,  which  includes  I  lie  .V  axes  of  hot  h  cam¬ 
eras  and  both  their  optical-axes.  This  is  illustrated  in  figure 
where  ()  is  the  focal node  of  one  camera.  .-1-  the  fixation  point. 

I'  the  projection  of  .1  on  the  image  plane  oi  the  origin  of  the 
camera  coordinate  system,  li  the  projection  of  a  given  point 
in  space  (f)  on  the  base  plane.  D' -  the  projection  of  II  on  the 
camera  A -axis,  and  C' -  the  projection  of  the  point  C  on  the 
image  plane.  In  the  base  plane,  we  add  the  point  I)  which  is 
the  projection  of  H  on  the  optical  axis  .1.1'.  I.et  r  denote  the 


angle  LB  AO.  Let  {x,y)  denote  the  coordinates  of  the  projection 
of  point  C  on  the  image,  so  that  x  =  A'B'  and  y  =  D'C'.  Let 
d  denote  the  distance  of  the  fixation  point  to  the  eye,  so  that 
d  =  AO ,  and  h  denote  the  focal  length  of  the  camera,  so  that 
h  =  A'O.  Using  similar  triangles,  one  can  verify  the  following: 

BU  _  W)  _  DU  d-TBrosip 

y  ~  WO  ~  WO  ~  h  ‘ 


C 

D-Afeifo,  B 

ABzostD^-r  / 


Figure  .1.  The  base  plane  viewed  from  above,  with  one  cam¬ 
era. 

in  other  words. 

hliC 

y  -  - - - .  ( 

d  -  /l  B  cas  y 

Using  the  same  arguments,  we  get 

AD  sin  y  DO  d  -  ~AB  cas  y 


In  other  words. 


ItAB  sin  y 
d  -  AD  cos  y 


y  DC 

Note  that  our  assumption  that  the  base  plane  intersects  both 
cameras  .A'-axes  implies  that  the  same  geometry  holds  for  both 
cameras  in  the  sense  that  t  he  segments  DC  and  .1 II  are  identical 
in  both  cases.  (.1  and  C  are  the  same  points,  and  D  is  identical 
since  ( ’  is  projected  onto  the  same  plane.)  Let  us  add  indices  for 
the  variables  of  the  left  and  right  cameras.  /  and  r  respectively. 
Then  __ 

tr  AD  . 

..  13  .  .  ^  C  ' 


^  '  -A/'+'s-  s'-T-  «_•  o  Ov  o  o.  o  OV"  s- 


.'•VI 


’>  TV  UK  w>  ''j  * JC» 


From  figure  1,  in  which  the  angle  of  gaze  iz  is  defined,  it  follows 
that  dr  —  go  that 

sjn 

,  I  cos(/j  -  v)  . 

y  =  ( - T-r - xsm/i  +  xcos/i)  ■ (0) 

sin  2  ft  n 

Applying  the  same  argument  to  the  y  coordinate  of  the  left  im¬ 
age,  we  get 

7cos(//  +  i/)  .  in 

y  =  ( - : — ; - +  x  sin  ft  +  ;  cos  // )  ■  — .  (6  ) 

sin  2/i  n 
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Since  by  definition  tan. I  =  j.  we  immediately  get  equation  (2). 

Let  us  develop  the  expression  for  the  image  coordinate  y 
above,  considering  the  right  image  with  no  loss  of  generality. 
We  then  have  from  (11), 
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Abstract 

In  numerous  computer  vision  applications,  there  is  both 
the  need  for  and  the  ability  to  access  multiple  types  of 
information  about  the  three  dimensional  aspects  of  objects  or 
surfaces.  When  this  information  comes  from  different  sources 
the  combination  becomes  non-trivial. 

This  paper  describes  an  approach  that  integrates  multiple 
visual  sensing  methodologies  yielding  three  dimensional 
information.  The  current  system  integrates  feature  based 
stereo  algorithms  with  various  shape-from-texture  algorithms 
(multi-view  shape-from-texture  and  shape-from-motion 
modules  expected  to  be  incorporated  in  the  future).  Unlike 
most  systems  for  multi-sensor  integration,  that  fuse  all  the 
information  at  one  conceptual  level,  e.g.,  the  surface  level,  the 
system  under  development  uses  two  levels  of  data  fusion, 
intra-process  integration  and  inter-process  integration.  The 
paper  discusses  intra-process  integration  techniques  for 
feature-based  stereo  and  shape-from-texture  algorithms.  It 
also  discusses  an  inter  process  integration  technique  based  on 
smooth  models  of  surfaces.  Examples  are  presented  using 
camera  acquired  images. 


1  Introduction 

This  paper  discusses  research  into  the  integration  of  two 
different  but  highly  compatible  modalities:  Shape-from 
multiple  feature  based  stereo  algorithms  which  use  different 
types  of  features  and  multiple  shape-from-texture  algorithms 
which  utilizes  different  types  of  textural  cues.  These 
modalities  are  being  considered  because  they  are,  in  general, 
applicable  to  similar  regions  of  an  images.  This  allows 
experimentation  with  both  corroborating  information,  e.g., 
stereo  images  of  textured  three  dimensional  surfaces,  and 
conflicting  information  (not  reported  here),  e.g.,  stereo  images 
of  a  two  dimensional  image  of  a  textured  three  dimensional 
surface. 

While  solving  the  general  fusion  problem  is  beyond  the 
current  abilities  of  AI  research,  there  has  been  progress  in 
restricted  contexts;  these  including  fusion  of  stereo  and  tactile 
data  [Allen  85;  Allen  86],  pointwise  fusion  of  data  [Henderson 
and  Fai  83],  and  fusion  of  various  information  about  an 
intensity  image  for  the  purpose  of  segmentation  (Belknap  et. 
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al.  85;  Kohler  84;  Mckeown  et.  al.  84],  There  has  also  been 
work  on  the  regularization  as  a  means  for  fusion  [Blake  and 
Zfrserman  86;  Medioni  and  Yasumoto  85;  Poggio  and  Torre 

All  of  the  above  mentioned  systems  fuse  their 
information  at  one  conceptual  level,  e.g.,  the  image  or  surface 
level.  In  contrast,  the  system  under  development  uses  two 
levels  of  data  fusion,  intra-process  integration  and  inter¬ 
process  integration.  The  former  consists  of  fusion  of 
information  generated  by  all  shape-from-X  approaches  with 
certain  predetermined  similarities,  e.g.,  feature  based  stereo 
algorithms  with  different  features.  The  latter  type  of 
integration  is  the  fusion  of  the  information  resulting  from  each 
of  the  intra-process  integration  phases,  with  any  a  priori 
knowledge,  e.g.,  smoothness  assumptions  or  model 
assumptions.  However,  to  allow  for  some  amount  of  top- 
down  processing  and  the  future  addition  of  world  based 
constraints  communication  is  performed,  between  each 
process,  through  a  global  blackboard. 

The  remainder  of  this  paper  is  divided  into  sections  on 
background  and  motivation,  texture  algorithms  and  related 
intra-process  integration,  stereo  algorithms  and  related  intra¬ 
process  integration,  and  inter-process  integration.  Following 
the  description  of  the  current  system,  the  results  of  limited 
experimental  testing,  using  camera  images,  is  presented. 

2  Motivation  and  Background 

As  research  in  vision  has  progressed,  it  has  been  realized 
that  the  information  available  from  a  single  “shape-from” 
algorithm  would  not  be  sufficient  to  solve  the  general  vision 
problem.  Prior  vision  research  has  yielded  different 
“modalities”  of  information  including;  numerous  approaches 
to  shape-from-texture  [Kender  80;  Witkin  80],  binocular 
stereo  [Marr  and  Poggio  79;  Eastman  and  Waxman  85;  Hoff 
and  Ahuja  85],  shape-from  shading  [Horn  70;  Lee  85],  and 
shape-from-motion  [Anadan  and  Weiss  85;  Prazdny  79],  Each 
of  these  sources  of  shape  information  have  different  domains 
of  applicability,  different  computational  complexity,  and 
different  error  characteristics.  For  any  given  module,  there 
exists  numerous  images  (or  regions  there  of)  for  which  the 
module  would  not  correctly  predict  surface  shape.  Some  of 
the  sources  are  complementary,  e.g.  shape-from-shading  will 
apply  generally  only  in  those  regions  where  shape-from- 
texture  will  fail.  Other  modules  can  act  in  either  a  competitive 
or  synergistic  fashion,  e.g.  binocular  stereo  and  shape-from- 
texture  will  generally  apply  in  the  same  regions  of  an  image, 
and  may  compete  for  dominance  if  their  outputs  differ,  or  can 
mutually  reinforce  a  consistent  interpretation. 


Robustness  could  therefore  be  gained  by  combining 
different  shape-from-X  methods  within  a  single  unified 
system.  In  designing  such  a  system  there  exist  two  basic 
problems: 

1.  Different  modalities  generate  information 
constraints  based  on  different  scale  assumptions. 

If  the  shape-from  methods  utilize  the  same  scale 
and  type  of  information  then  all  of  the 
constraints  that  relate  to  the  same  image  area  can 
be  integrated  to  generate  a  single  date  element 
for  that  image  area,  e.g.  fusing  all  of  shape-ffom- 
texture  methods.  If  the  methods  utilize  different 
scale  and/or  different  date  type  then  a  common 
level  must  be  chosen. 

2.  Generating  a  single  interpretations  from  the 
large  numbers  of  cooperating  and/or  conflicting 
information  sources  is  predicated  on  choosing  a 
“reasonable”  confidence  weighting  for  each 
constraint.  The  confidence  weighting  should  be 
comprised  of  two  basic  components: 

•  An  intra-method  weighting  that  states  how 
closely  the  image  data  fits  the  shape-from 
method’s  underlying  assumptions. 

•  An  inter-method  weighting  which  assures 
that  no  single  method  dominates  because 
it  generates  substantially  more  constraints. 
Instead,  a  method  should  dominate 
because  its  constraints  define  a  single 
datum  of  information  whose  confidence  is 
higher  than  another  method(s).  Model 
driven  expectations  can  also  be  included 
in  this  weight  if  there  exists  a  predeliction 
for  one  method  over  another. 


3  Integrating  Methodologies 

The  two-level  fusion  methodology  decouples  data 
acquisition  and  its  related  assumptions  from  the  final  results  of 
data  fusion  and  surface  generation.  The  lower  level  intra- 
process  integration  techniques  derives  orientation  information 
based  on  the  underlying  level  of  abstraction.  The  explicit  and 
implicit  assumptions  of  the  underlying  intra-shape-firom-X 
methods  are  used  to  weigh  the  confidence  of  each  orientation 
constraint  in  relationship  to  constraints  generated  by  related 
intra-shape-from-X  methods  (  e.g.  constraints  from  shape- 
from-uniforrn-texel-spacing  with  constraints  from  shape-from- 
uniform-texel-size).  A  brief  presentations  is  given  of  two  such 
intra-process  integration  techniques. 

Since  the  final  result  of  the  data  fusion  is  assumed  to  be 
objects  with  smooth  surfaces,  the  inter-process  integration 
depends  on  our  model  of  surfaces.  The  current  system 
employs  a  regularization  based  surface  reconstruction 
technique  for  this  task.  This  approach  allows  the  system  to 
independently  weight  each  piece  of  information  from  the 
intra-process  integration  phases.  Part  of  this  weighting  is  a 
global  factor  determining  which  of  the  modalities  has  higher 
priority. 

The  interaction  among  the  various  computational 
modules  as  well  as  the  integration  modules,  is  accomplished 
with  a  blackboard  organization.  This  scheme  allows 
bidirectional  flows  of  information  and  provides  a  means  for 
easy  detection  of  when  the  information  necessary  for  the 


execution  of  each  modules  is  present  in  the  system.  This 
portion  of  the  system  will  not  be  considered  further  in  this 
paper. 

A  two  level  integration  approach  has  two  additional 
advantages.  The  first  is  computational  in  nature,  and  derives 
from  the  fact  that  it  is  easier  to  heuristically  combine  data 
from  similar  sources,  as  in  the  intra-process  integration  phase. 
Then,  when  the  system  must  integrate  information  from 
markedly  different  sources,  that  information  should  already  be 
of  higher  quality  than  the  initial  raw  data.  This  separation  of 
duties  also  aids  in  maintaining  system  modularity,  and 
minimizes  global  memory  requirements. 

The  second  reason  for  desiring  a  multi-level  integration 
scheme  follows  from  studies  of  human  vision.  Considerable 
research  exists  on  the  human  perception  of  three-dimensional 
surfaces  [julesz-61;  bulthoff-mallot-87].  Many  of  these  works 
have  used  selective  stimuli,  e.g.,  random-dot  stereograms,  to 
study  phenomenological  aspects  of  depth  perception  from 
various  sources.  From  these  studies  one  can  draw  inferences 
about  the  integration  of  information  from  “modules”  using 
this  information.  Other  works  have  studied  the  “ordering”  of 
operations  in  the  human  visual  system  and  the  interaction  of 
various  information  sources  for  depth  perception.  Of  particular 
relevance  to  the  question  of  multi-level  integration  is  the  work 
of  [bulthoff-mallot-87],  which  examined  the  interaction,  both 
rivalarous  and  mutually  supportive,  among  various 
information  sources. 


4  Texture  processes  and  texture  intra-process 
integration 

This  section  discusses  our  approach  to  the  problem  of 
deriving  orientation  information  from  multiple  independent 
textual  cues.  The  method  consists  of  two  major  phases:  the 
generation  of  constraints  on  the  orientation  of  texel  patches *, 
and  intra-process  integration  where  the  orientation  constraints, 
for  each  patch',  are  fused  into  a  “most  likely”  orientation.  The 
robustness  of  this  approach  has  already  been  demonstrated 
elsewhere,  see  [Moerdler  and  Render  87a;  Moerdler  and 
Render  87b;  Moerdler  88]. 

Currently  the  shape-from-texture  methods  used  are: 
shape-from-uniform-texel-spacing  [Render  83],  and  shape- 
from-uniform-texel-size  [Ohta  et.  al  81].  These  two  methods 
generate  orientation  constraints  for  different  overlapping 
classes  of  textures. 


4.1  Background 

Current  methods  to  derive  shape-from-texture  are  based 
on  measuring  a  distortion  that  occurs  when  a  textured  surface 
is  viewed  under  perspective,  assuming  of  course  that  natural 
texture  neither  mimics  nor  cancels  projective  effects.  The 
perspective  distortion  results  in  some  aspect  of  the  texture 
being  deformed  when  the  scene  is  imaged.  In  order  to  simplify 
the  recovery  of  the  orientation  parameters  from  this  distortion, 
researchers  have  imposed  limitations  on  the  applicable  class  of 
textured  surfaces.  Some  of  the  limiting  assumptions  include 
uniform  texel  spacing  [Render  80;  Render  83;  Moerdler  and 
Render  85],  uniform  texel  size  [Ikeuchi  80;  Ohta  et.  al.  81], 


'A  texel  patch  is  a  2-D  description  of  a  sub-image  that  contains  one  or 
more  textural  elements.  The  number  of  elements  that  compose  a  patch  is 
dependent  on  the  shape-from-texture  algorithm. 


ttjWjVV 


jwwrj'.v  t*’.*  vat 


uniform  texel  density  [Aloimonos  86],  and  texel  isotropy 
[Witkin  80;  Dunn  84].  These  are  strong  limitations  causing 
methods  based  on  them  to  be  applicable  to  only  a  limited 
range  of  real  images. 


4.2  Design  Methodology 

The  generation  of  orientation  constraints  from 
perspective  distortion  is  performed  using  one  or  more  image 
texels.  The  orientation  constraints  can  be  considered  as  local, 
defining  the  orientation  of  individual  surface  patches  called 
texel  patches.  Texel  patches  are  defined  by  how  each  method 
utilizes  the  texels.  Some  methods,  e.g.,  uniform  texel  size,  use 
a  measured  change  between  two  texels;  in  this  case  the  texels 
patches  are  the  texels  themselves.  Other  methods,  e.g., 
uniform  texel  density,  use  a  change  between  two  areas  of  the 
image.  In  the  latter  case  the  texel  patches  are  predefined  areas 
of  the  image.  For  the  texture  modules,  intra-process  fusion  is 
carried  out  at  the  texel  patch  level. 

This  differs  from  integration  at  the  surface  level  which 
has  been  attempted  elsewhere  (e.g.,  [Dceuchi  80]  and 
[Aloimonos  86])  and  use  constraint  propagation  and 
relaxation  to  derive  a  single  orientation  per  surface  patch.2 

The  process  of  fusing  orientation  constraints  and 
generating  surfaces  can  be  broken  down  into  the  following 
three  phases:  (1)  the  creation  of  texel  patches,  (2)  calculation 
of  (multiple)  orientation  constraints  for  each  texel  patch,  and 
(3)  the  unification  of  the  orientation  constraints  per  texel  patch 
into  a  “most  likely”  orientation.  Each  of  the  remaining 
portions  of  this  subsection  describes  one  of  these  phases. 

4.2.1  Texel  patch  definition 

There  has  been  considerable  work  in  computer  vision  on 
the  automatic  recognition  of  textural  patches.  While  accurate 
and  consistent  texel  patch  recoveiy  would  greatly  simplify  the 
integration  process,  we  feel  that  such  data  is  unavailable  at  the 
present  time.  Instead,  we  have  chosen  a  simplistic  patch 
definition  obtained  by  first  processing  the  image  with  assorted 
filters  and  then  thresholding  the  image  to  define  patches.3  We 
acknowledge  that  better  texture  discrimination  algorithms 
exist,  but  this  was  not  the  focus  of  our  research.  In  the  work 
described  in  this  paper  (and  described  in  greater  depth  in 
[Moerdler  88])  we  have  filtered  the  image  by  local  averaging 
of  the  gray  levels.  We  have  also  experimented  with  edge 
detection  and  edge  orientation  filters  (with  and  without  post 
filtering  smoothing),  although  those  filters  are  not  used  on  the 
examples  herein. 

4.2.2  Surface  Patch  and  Orientation  Constraint 

Generation 

The  first  phase  of  the  system  consists  of  several  shape- 
from-texture  components  generating  augmented  texels.  Each 
augmented  texel  consisting  of  a  texel  patch,  orientation 
constraints  for  the  texel  patch,  and  an  confidence  weighting 
per  constraint.  The  orientation  constraints  are  stored  in  the 
augmented  texel  as  vanishing  points  which  are  mathematically 


'Thu-e  approaches  were  further  limited  by  the  use  of  only  a  single 
shape-from-texture  method. 

3 An  analysis  of  this  type  of  technique  for  texture  discrimination  can  be 
found  in  (Davis  et.al.  84). 


equivalent  to  a  class  of  other  orientation  notations  (e.g.  pan 
and  tilt  constraints)  [Shafer,  Kanade,  and  Render  83]. 
Moreover,  they  are  simple  to  generate  and  compact  to  store. 

The  confidence  weighting  is  defined  separately  for  each 
shape-from  method  and  is  based  upon  the  intrinsic  error  of  the 
method.  For  example,  shape-from-uniform-texel-spacing’s 
confidence  weighting  is  a  function  of  the  total  distance 
between  the  texel  patches  used  to  generate  that  constraint.  The 
confidence  measure  decreases  as  the  distance  between  the 
texels  increase  because  once  the  inter-texel  distance  grows  too 
large  the  local  surface  is  no  longer  approximated  by  a  plane 
and  the  orientation  error  grows.  This  further  acts  to  make  the 
constraints  group  locally  rather  than  globally  which  is  valid 
since  texels  that  are  part  of  the  same  surface  are  normally 
located  close  together. 

The  current  system  contains  two  shape-from-texture 
methods;  shape-from-uniform-texel-spacing  [Moerdler  and 
Render  85]  based  on  the  assumption  that  the  texels  can  be  of 
arbitrary  shape  but  are  equally  spaced  ,  and  shape-from- 
uniform-texel-size  [Ohta  et.  al.  81]  based  on  the  unrelated 
criteria  that  the  spacing  between  texels  can  be  arbitrary  and 
the  size  of  all  of  the  texels  is  equivalent  but  unknown.  Each  of 
the  methods  is  based  on  a  different  textural  characteristic  that 
allows  the  generation  of  orientation  constraints  and  also  limits 
the  applicability  of  the  approach.  Future  plans  call  for  the 
inclusion  of  shape-from-convergence  [Render  80],  and  shape- 
from-ellipticity  of  circular  textures. 

In  shape-from-uniform-texel-size  two  texels  are  used  T, 
and  T2  whose  sizes  are  S]  and  S2  respectively.  If  the  distance 
from  the  center  of  mass  of  texel  T x  to  texel  T2  (see  figure  1)  is 
defined  as  D  then  the  distance  from  the  center  of  texel  T2  to  a 
point  on  the  vanishing  line  can  be  written  as  : 
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Figure  1:  The  calculation  of  shape-from-uniform-texel-size 

In  shape-from-uniform-texel-spacing  the  calculations  arc 
similar.  Given  any  two  texels  T,  and  T2  (see  figure  2)  whose 
inter-texel  distance  is  defined  as  D,  if  the  distance  from  T,  to  a 
mid-texel  T,  is  equal  to  L  and  the  distance  from  T2  to  the 
same  mid-texel  T3  is  equal  to  R.  the  distance  from  texel  T|  to 
a  vanishing  point  is  given  exactly  by  : 
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Figure  2:  A  geometrical  representation  of  back-projecting. 


4.3  Infra-Process  Integration  for  Textures: 

Generation  of  Most  Likely  Orientation 

Once  the  orientation  constraints  have  been  generated  for 
each  augmented  texel,  the  next  step  consists  of  unifying  the 
constraints  into  one  orientation  per  augmented  texel.  A  simple 
and  computationally  feasible  method  of  integrating  the 
constraints  generated  by  the  intra-shape-from-texel  is  to  use  a 
Gaussian  Sphere  which  maps  the  orientation  constraints  to 
points  on  the  sphere  [Shafer,  Kanade,  and  Render  83].  A 
single  vanishing  point  circumscribes  a  great  circle  on  the 
Gaussian  Sphere;  two  different  constraints  generate  two  great 
circles  that  overlap  at  two  points  uniquely  defining  the 
orientation  of  both  the  visible  and  invisible  sides  of  the  surface 
patch. 

The  Gaussian  sphere  is  approximated  within  the  module 
by  the  hierarchical  tesselated  Gaussian  Sphere  based  on 
triangular  shaped  faces  called  trixels  [Fekete  and  Davis  84], 
The  top  level  of  the  hierarchy  is  the  twenty  face  icosahedron. 
At  each  level,  other  than  the  lowest  level  of  the  hierarchy, 
each  trixel  has  four  children  which  more  closely  approximate 
the  curvature  of  the  spherical  surface  than  their  parent.  This 
hierarchical  methodology  allows  the  user  to  specify  the 
accuracy  to  which  the  orientation  can  be  calculated  by 
defining  the  number  of  levels  of  tesselation  that  are  created. 

The  intra-texture-process  integration  phase  generates  the 
“most  likely”  orientation  for  each  texel  patch  by 
accumulating  the  evidence  from  all  the  orientation  constraints 
(generated  in  phase  one)  for  the  patch.  For  each  constraint,  it 
initially  visits  the  twenty  top  level  trixels,  determines  whether 
the  great  circle  falls  on  the  trixel  and  if  the  result  is  positive, 
visits  the  children.  At  each  lowest  level  trixel  through  which 
the  great  circle  travels,  the  likelihood  value  of  the  trixel  is 
incremented  by  the  constraint’s  weight.  The  hierarchical 
nature  of  this  approach  limits  the  number  of  trixels  that  need 
to  be  visited.  Once  all  of  the  constraints  for  a  texel  patch  have 
been  considered,  a  peak  finding  program  smears  the  likelihood 
values  at  the  lowest  level  trixels.  The  “most  likely” 
orientation  is  defined  to  be  the  trixel  with  the  largest  smeared 
value. 

This  method  does  not  assure  that  under  all  circumstances 


a  single  “most  likely”  orientation  will  be  derived.  When  more 
than  one  “most  likely”  orientation  is  derived  for  a  patch  the 
module  performs  a  Waltz  type  filtering.  It  computes  the  ‘  most 
likely”  orientation  for  the  remaining  augmented  texels  and 
then  re-analyzes  the  orientation  constraints  for  each  texel  that 
does  not  have  a  single  “most  likely”  orientation.  For  each 
unsolved  texel  patch  the  module  considers  all  of  the  patch  s 
constraints  and  removes  any  constraints  that  do  not  correctly 
define  the  “most  likely”  orientation  of  another  texel  patch. 
Once  this  constraint  pruning  has  occurred  the 
recomputes  the  “most  likely”  orientation  for  the  patch.  This 
secondary  analysis  does  not  assure  a  single  most  likely 
orientation  either,  but  it  does  aid  in  simplifying  and  deriving  a 
single  “most  likely”  orientation  for  the  largest  number  of 
surface  patches. 


5  Stereo  processes  and  intra-process  integration 

The  stereo-based  processes  of  the  system  are  based  on 
matching  features  between  the  two  images.  The  system  uses 
multiple  feature  definitions  to  insure  both  good  localization 
and  noise  resistance.  These  feature  are  then  classified  as  to 
amount  of  ambiguity.  The  system  starts  with  the  least 
ambiguous  matches  and  reconstructs  a  disparity  surface.  Intra- 
stereo-integration  is  accomplished  through  a  regularized 
reconstruction  of  the  disparity  field  based  on  the  assumption 
that  the  smooth  surfaces  in  the  world  give  rise  to  a  smooth 
disparity  surface.  After  all  points  are  considered,  the  mtra- 
process  module  adds  its  output  to  the  blackboard.  Currently 
this  output  is  depth  values  at  various  points,  especially  along 
the  “edges”  of  surfaces  in  the  disparity  field  and  at  the 
locations  of  feature  points. 


5.1  Definition  of  the  multiple  features 

A  common  problem  in  stereo  systems  is  that  the  features 
are  too  sparse,  have  poor  localization,  or  are  sensitive  to  noise. 
Rather  than  attempt  to  define  yet  another  feature  for  matching, 
the  stereo  module  currently  combines  two  different  types  of 
features.  These  are:  (1)  zero  crossings  of  laplacian  of 
gaussians  of  the  images,  which  are  subsequently  thresholded 
(based  on  magnitude  of  crossing)  and  matched  along 
approximately  epi-polar  lines  using  orientation  and  sign  as  a 
filters),  and  (2)  centroids  of  texels  defined  in  the  shape-from- 
texture  algorithm  (with  some  of  the  other  texel  features  used 
to  insure  only  valid  matches).  The  first  of  these  features,  zero 
crossing,  provide  a  large  number  of  features  for  the  matching 
algorithm,  unfortunately  the  localization  of  these  features  are 
not  highly  accurate.  The  second  set  of  features,  texel 
centroids,  are  not  very  dense,  however,  they  provide  very 
accurate  localization  of  the  feature. 

In  the  future  we  will  be  adding  features  derived  from 
area  based  correlations,  interest  operators  (e.g.  [Moravec  79]) 
and  a  thresholded  sobel  operator. 


5.2  Intra-stereo-integration  module 

The  integration  of  the  various  features  is  accomplished 
by  a  multi-pass  matching  algorithm,  where  the  quality  of 
local  izationyambiguity  effects  the  order  in  which  points  are 
considered,  and  previously  matched  points  effect  the 
disambiguation  of  other  points.  The  matching  algorithm  used 
is  described  in  detail  elsewhere,  [Boult  and  Chen  87],  only  a 
brief  description  is  presented  here.  The  basic  assumption 
underlying  the  matching  algorithm  is  that  the  disparity  surface 


1 


should  be  smooth.  In  vision  research,  there  have  been  many 
model  based  matching  algorithms  proposed  with  different 
“smoothly”  varying  disparities,  [Marr  and  Poggio  79; 
Mayhew  and  Frisby  81;  Eastman  and  Waxman  85].  The 
smooth  disparity  fields  used  for  this  system  are  based  on 
generalized  two  dimensional  smoothing  splines,  see  [Boult 
86].  The  smoothness  criterion  is  similar  to  one  used  in  smooth 
surface  reconstruction,  see  [Blake  and  Zisserman  86;  Boult  86; 
Grimson  79;  Terzopoulos  84]. 

The  system  starts  with  the  feature  points  which  have 
“unique  matches”  and  good  localization  (i.e.,  at  the  current 
time  it  begins  with  the  centroids  of  the  “texel”).  In  all 
neighborhoods  without  these  features,  lower  quality  (in  terms 
of  localization)  features  with  “unique”  matches  are  added, 
however  they  are  given  a  lower  confidence  value.  Thus  when 
the  smooth  surface  is  fitted  to  the  disparity  data,  the  disparity 
values  generated  by  lower  quality  features  will  not  be  as 
closely  approximated. 

After  all  the  “unique”  matches  have  been  used,  the 
module  reconstructs  a  disparity  surface.  This  reconstruction  is 
based  on  the  assumption  that  the  disparity  surface  should  give 
rise  to  a  surface  with  smooth  depth  changes.  Using  this 
disparity  surface,  the  module  disambiguates  other  matches  by 
choosing  the  potential  match  which  comes  closest  to  the 
smooth  surface.  The  distance  between  the  disparity  predicted 
by  the  “best”  match  and  the  smoothed  disparity  surface 
affects  the  confidence  of  the  match,  which  in  turns  affects  the 
way  the  disparity  surface  approximates  that  match.  The 
disambiguation  takes  place  in  multiple  passes  each  of  which 
incorporates  features  that  are  increasingly  ambiguous. 

After  all  points  are  considered,  the  intra-process  module 
adds  its  output  to  the  blackboard.  Currently,  this  output  is 
depth  values  at  various  points,  especially  along  the  “edges” 
of  surfaces  in  the  disparity  field,  and  depends  on  the 
calibration  of  the  imaging  system.  Future  plans  call  for  the 
module  to  output  surface  orientation  information  in  the  place 
of  depth  data,  thereby  eliminating  the  need  for  calibration. 


6  Inter-process  integration  and  surface 
reconstruction 

This  section  describes  the  inter-process  integration  phase 
of  the  system.  This  phase  of  the  fusion  process  is  predicated 
on  the  assumption  that  the  world  is  comprised  of  piecewise 
smooth  surfaces/objects.  Therefore,  the  inter-process 
integration  should  depend  on  the  assumed  smoothness 
model(s)  for  surfaces,  not  on  the  data  acquisition  techniques. 
There  are  two  main  aspects  of  the  inter-process  integration, 
basic  surface  building,  and  the  weighting  of  various  modules. 


6.1  Basic  Surface  Reconstruction  Technique 

Inasmuch  as  they  can  be  expressed  in  terms  of  inverse 
optics,  many  problems  in  computer  vision,  including  surface 
reconstruction  from  sparse  3D  information,  are  ill-posed.  One 
way  of  reformulating  these  ill-posed  problems  is  through  a 
well  known  technique  called  regularization.'*  Let  us  precisely 
define  the  problem  at  hand: 


‘‘There  exist  numerous  ways  to  calculate  such  surfaces,  see  Chapter  9 
[Boult  86]  for  a  critical  comparison  of  4  methods. 


Let  F  j,  the  space  of  allowed  surfaces,  be  a  Hilbert  or 
semi-Hilbert  space.  Let  |M|F  t  be  a  norm  (or  semi-norm  if  F  , 
is  semi-Hilbert)  measuring  the  “unreasonableness”  of  a 
surface  /.  Let  A„( 0  =  [L, (f),  ....  L„(/)l  be  the  given 
information.  Then  the  visual  surface  reconstruction  problem 
is  to  find  f  e  F  j  such  that 

,)(/)  a  min/^  d(/)  where  d(-)  is  defined  m 

« Kf)  =  X  •  ll/ll,,  +  £?-.  |«.  x  £,,(/)  -  £,(/))* 

The  norm  (semi-norm)  ||»||^  is  generally  refereed  to  as 

the  stabilizing  functional  of  the  regularization.  The  class  of 
functions  Fj  is  an  often  overlooked,  but  immensely  important, 
part  of  the  regularization.  One  cannot  indiscriminately  choose 
how  to  regularize  a  problem.  As  pointed  out  in  [page 
315]  [Poggio  et.al.  85]5, 

“...  standard  regularization  methods  have  to  be 
applied  after  a  careful  analysis  of  the  ill-posed 
nature  of  the  problem.  The  choice  of  the  IM^  of  the 

stabilizing  functional  |M|y  and  of  the  functional 

spaces  involved  is  dictated  by  both  mathematical 
properties  and  by  physical  plausibility.  They 
determine  whether  the  precise  conditions  for  a 
correct  regularization  hold  for  any  specific  case.” 

The  use  of  regularization  for  reconstruction  of  smooth 
surfaces  in  vision  was  first  proposed  in  [Grimson  79].  In  that 
pioneering  work  Grimson  discussed  the  choice  of  the  most 
appropriate  stabilizing  functional  (though  in  different 
terminology).  However,  his  decision  was  partially  based  on 
an  assumed  relationship  between  the  stabilizing  functional  and 
the  zero-crossings  of  the  intensity  images  which  lead  to  the 
stereo  depth  data.  With  respect  to  visual  surface 
reconstructions,  many  researchers  (e.g.,  [Terzopoulos  84;  Hoff 
and  Ahuja  85;  Poggio  et.al.  85;  Lee  85])  seem  to  have 
accepted  the  choice  of  norm,  stabilizing  functional  and 
associated  functional  spaces  that  were  initially  proposed  in 
[Grimson  79].' 6 

However,  the  authors  feel  that  this  class  is  “too 
smooth”  and  thus  have  adopted  the  assumption  that  world 
surfaces  can  be  piecewise  modeled  as  surfaces  belonging  to 
the  class  of  D’2  H'  5  with  .the  second  Sobolev  semi-norm, 
which  can  intuitively  be  described7  as  having  only  1.5 
derivatives  in  L2.  In  work  reported  elsewhere,  [Boult  87],  this 
assumption  was  shown  to  be  at  least  as  reasonable  as 
assuming  surfaces  in  D2  L2.  The  reconstruction  of  the 
surfaces  might  be  accomplished  with  discrete  regularization 
techniques.  However,  the  approach  taken  herein  is  based  on 
generalized  smoothing  spline  functions,  see  [Boult  86]  and  the 
references  therein,  and  is  more  efficient  for  sparse  data,  see 
[Boult  85], 


5The  mathematical  notation  in  the  quote  has  been  modified  to  match  that 
used  in  this  paper. 

6It  is  not  clear  if  this  acceptance  is  simply  because  the  method  gave 
reasonable  results  and  there  was  a  lack  of  alternatives,  or  because  the 
choice  is  actually  the  most  appropriate. 


Tor  a  precise  definition  see  [Boult  86:  Boult  87;  Meinguet  79], 


The  system  allows  for  each  data  point  to  be  individually 
weighted  in  terms  of  its  contribution  to  the  allowed  fitting 
error  (the  terms  8;  in  the  above  definition).  The  choice  of  these 
weights  is  influenced  by  two  things,  the  confidence  passed  for 
the  point  from  the  intra-process  integration  processes,  and  the 
weighting  assigned  to  the  module  as  a  whole. 

6.2  Weighting  the  outputs  of  the  various  intra-process 
integration  modules. 

The  above  fusion  scheme  requires  that  each  data  point 
be  given  a  weight.  The  correct  selection  of  these  weights  is 
difficult.  The  study  of  these  weighting  will  be  of  paramount 
importance  in  our  future  research.  Currently,  the  system 
builds  three  surfaces: 

1.  One  surface  from  the  output  of  the  intra-texture- 
integration  module  using  the  weighting  supplied 
by  that  module, 

2.  One  surface  from  the  output  of  the  intra-stereo¬ 
integration  module  using  the  weighting  supplied 
by  that  module,  and 

3.  One  surface  combining  all  data.  For  the 
combination,  the  weights  are  divided  by  the 
number  of  data-points  output  by  a  module.  This 
provides  some  means  for  the  texture  data  to  have 
an  effect  on  the  surface.  Otherwise  the  stereo 
data  (with  500-5000  points)  would  totally 
dominate  the  shape  information  from  texture 
(which  only  provides  -10-50  data  points). 

While  it  would  be  nice  for  the  system  to  choose  which  of 
these  surface  is  the  “correct”  one,  this  is  not  possible.  When 
the  information  is  conflicting,  the  “correct”  precept  is 
subjective,  and  can  often  be  changed  by  will  in  humans. 
However,  when  the  surfaces  agree,  the  system  should  be  able 
to  (but  currently  cannot)  take  note,  and  remove  the  redundant 
representations. 


7  Experimentation 

The  system  described  in  this  paper  is  still  under 
development  and  has  only  been  subject  to  limited 
experimental  testing.  Presented  in  this  section  are  two 
examples  of  the  system  working  on  camera  images.  The 
results  are  presented  in  figures  3  to  12.  One  example,  the 
curved  roll  of  paper  (see  figure  3)  demonstrates  a  surface 
where  stereo  dominates,  but  is  significantly  aided  by  the 
texture  information.  In  the  other  example  (see  figure  8, 
texture  was  more  successful  because  the  scan  line  coherence 
assumed  in  stereo  was  violated  by  a  slight  rotation  of  the 
scene.  The  stereo  module  was  thus  slightly  off  on  each 
“texel”  and  produced  a  more  bumpy  surface  (not  shown). 


8  Conclusions  and  future  plans 

This  paper  describes  an  ongoing  research  project  that 
integrates  two  modalities:  shape-from-texture  and  information 
from  stereo.  The  system  makes  use  of  a  two  level  integration 
scheme  and  while  the  experimental  analysis  is  not  complete, 
initial  results  show  this  multi-level  integration  to  be  both 
efficient  and  effective. 


Figure  3:  Left  and  right  input  images  for  example  1 : 
An  artificially  textured  roll  of  paper 
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Figure  4:  Laplacian  of  Gaussian  of  left  image  for  Stereo 
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Figure  5 :  Texels  identified  in  left  image 


Future  work  will  include  the  addition  of  modules  for 
shape-from-multi-view-texture,  shape-from- shading,  and 
possibly  shape-from-motion.  In  addition,  the  number  of 
processes  in  both  the  existing  stereo  and  texture  modules  will 
be  expanded. 

An  important  task  awaiting  these  researchers  is  the  study 
of  the  weighting  of  the  outputs  of  the  various  modules,  and 
some  attempt  at  development  of  non-ad  hoc  criterion. 
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Figure  6:  Reconstruction  from  just  stereo  data 


Figure  7:  Reconstruction  from  combined  stereo  and  texture  data 


Figure  S:  Left  and  right  input  images  for  example 
"Circuit  Breadboard" 


Figure  9:  I  .aplactan  of  Gaussian  of  left  image  for  Stereo 


Figure  10:  Texels  identified  in  left  image 
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Abstract 

This  paper  describes  additional  results  on  the 
work  (Lim  1987a]  which  appeared  in  last  Image  Un¬ 
derstanding  Workshop.  We  design  and  implement  a 
high-level  stereo-vision  system  which  achieves  global 
correspondence  between  high-level  structures  of  stereo 
pairs.  This  structural  correspondence  reduces  match¬ 
ing  complexity,  and  provides  globally  consistent  match¬ 
ing  results  avoiding  local  mismatches.  Recovered  range 
data  is  segmented  into  surfaces  and  bodies. 

1  Introduction 

We  perform  stereo  correspondence  using  five  different 
types  of  structures:  bodies,  surfaces,  curves,  junctions, 
and  edgels  (Figure  1).  These  structures  are  segmented 
and  grouped  into  a  hierarchical  structure  based  on  in¬ 
clusion  (Figure  2  and  Figure  3).  The  matching  of  com¬ 
plex  structures  at  the  top  level  is  used  to  guide  and 
constrain  the  matching  of  simpler  structures  at  lower 
levels  of  the  hierarchical  structures  (Figure  4).  This 
is  the  first  system  to  perform  matching  at  the  level  of 
surfaces  and  bodies. 

The  system  applies  monocular  interpretation  in  or¬ 
der  to  segment  individual  images  before  matching. 
Given  a  pair  of  stereo  images  (Figure  5),  edgels  are 
detected  using  the  Nalwa  edge  operator  [Nalwa  1986]. 
Detected  edges  are  then  fitted  with  straight  lines  or 
conics  [Nalwa  1987]  (Figure  6).  We  extend  curves 
around  junctions  to  locate  missing  edges  and  form  new 
junctions  (Figure  7).  These  junctions  are  classified  ac¬ 
cordingly.  Surfaces  are  segmented  from  curves  and  are 
grouped  into  separate  bodies  (Figure  8  and  Figure  9). 
Because  of  this  segmentation  effort,  we  are  matching 
symbolic  structures.  The  resulting  three  dimensional 

*Thii  work  was  cupported  by  the  Defenve  Advanced  Research 
Projects  Agency  under  contract  number  N00039-B4-C-0211 

^Present  Address:  IBM  Corporation,  Los  Angeles  Scientific 
Center,  11601  WOshire  Boulevard,  Los  Angeles,  CA  00025 


Edgel 


Figure  I:  Higher- level  Structures  have  Fewer  Oc¬ 
curences 

depth  data  is  segmented  and  has  symbolic  meanings  at¬ 
tached  to  the  parts  (Figure  10).  Thus  it  is  not  necessary 
to  perform  further  segmentation  in  three-dimensional 
space  as  other  systems  must. 

Because  of  the  hierarchical  structure,  the  number  of 
bodies  at  the  highest  level  is  two  orders  of  magnitude 
less  than  the  number  of  edgels  at  the  lowest  level.  We 
show  that  the  complexity  of  the  matching  is  thus  re¬ 
duced  by  at  least  four  orders  of  magnitude  in  time. 

Section  2  explains  in  details  the  structures  chosen  for 
stereo  correspondence.  The  structures  include  edgels, 
curves,  junctions,  surfaces,  and  bodies.  They  are  seg¬ 
mented  into  a  hierarchical  structure  by  edge  detection, 
curve  fitting,  curve  extension,  junction  classification, 
and  monocular  interpretation.  Some  viewpoint  sen¬ 
sitive  and  invariant  properties  of  these  structures  are 
discussed. 

Section  3  focuses  on  the  matching  strategy.  The 
epipolar  constraint  is  extended  to  surfaces  and  bod¬ 
ies.  This  is  a  major  constraint  used  in  this  work.  It  is 
used  for  determining  the  matching  of  bodies  and  sur¬ 
faces.  A  hierarchical  constraint  is  introduced.  We  show 
that  the  matching  complexity  is  reduced  by  four  orders 
of  magnitude  through  the  hierarchical  arrangement  of 
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Figure  2:  Grouping  of  Structures  by  Inclusion 


o  o 

<5  <n 


Figure  3:  Hierarchical  Arrangement  of  Structures 


structures. 

Results  of  implementation  are  shown  in  Section  4. 
These  include  the  detection  and  segmentation  of  vari¬ 
ous  structures,  and  how  they  are  grouped  into  a  hier¬ 
archical  structure.  Projection  of  matching  results  are 
given. 

Conclusion  and  discussion  are  presented  in  Section  5. 


2  Structures 

Correspondence 


for 


For  stereo  correspondence,  we  desire  to  have  struc¬ 
tures  that  are  invariant  under  general  viewpoints. 
Viewpoint-invariant  structures  exhibit  the  same  char¬ 
acteristics  from  different  viewing  positions  and  thus 
correspondences  may  easily  be  located.  Low-level 
structures  are  usually  not  viewpoint-invariant.  Edge 
angle  and  interval  length  are  quasi-invariant.  Con¬ 
trast  across  an  edge  and  gray-scale  intensity  are  vari¬ 
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Figure  4:  Hierarchical  Matching  of  Structures 


ant  properties  that  depend  on  viewing  positions.  These 
variant  structures  are  used  for  correspondence  in  pre¬ 
vious  stereo  systems  because  there  is  a  high  correlation 
between  the  appearance  of  these  structures  in  stereo 
images,  especially  when  the  angle  between  different 
viewpoints  is  small.  Arbitrary  thresholdings  or  heuris¬ 
tic  weighting  mechanisms  are  required  to  incorporate 
these  viewpoint-dependent  structures.  We  avoid  using 
these  viewpoint-dependent  structures  as  primary  struc¬ 
tures  for  matching,  but  they  certainly  render  useful 
supportive  evidence  when  no  viewpoint-independent 
information  is  available. 

In  this  section,  we  will  discuss  the  structures  which 
we  have  chosen  for  correspondence:  edgels,  curves, 
junctions,  surfaces,  and  bodies.  Various  viewpoint- 
independent  characteristics  of  these  structures  are  de¬ 
scribed.  They  provide  useful  constraints  during  the 
matching  process.  We  also  describe  methods  for  seg¬ 
menting  these  chosen  structures  from  inrages.  The  seg¬ 
mentation  process  may  be  carried  out  on  each  image 
independently  and  they  can  thus  be  segmented  simul¬ 
taneously,  grouped,  and  arranged  hierarchically. 


2.1  Edgels 


Edgels  are  the  low-level  structures  chosen.  They  are 
derived  from  the  gray-scale  intensity  image  and  provide 
a  good  source  of  structures  for  matching.  Edgels  may 
correspond  to: 


•  depth  discontinuities  (e.g.  occlusion) 


e  continuous-surface-normal  depth  discontinuities 
(e.g.  limb) 


•  surface-normal  discontinuities  (e.g.  crack) 


•  surface-reflectance  discontinuities  (e.g.  surface 
marking) 
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illumination  discontinuities  (e  g.  shadow) 


The  second  type  of  discontinuities  is  dependent  on 
viewpoint.  Other  discontinuities  correspond  to  physical 
changes  on  surfaces  in  three-dimensional  world.  These 
changes  are  reflected  in  both  images.  Thus,  they  are 
ideal  for  matching. 

Continuous-surface-normal  depth  discontinuities 
correspond  to  limbs,  which  are  viewpoint-dependent. 
Thus  a  pair  of  corresponding  limbs  appearing  in  two 
different  images  do  not  correspond  to  the  same  phys¬ 
ical  points  in  space.  Therefore,  they  should  not  be 
matched  directly.  A  further  discussion  on  how  to  utilize 
limbs  for  reconstruction  of  curved  surfaces  is  presented 
in  (Lim  1988a]. 

Although  the  detection  of  an  edgel  is  an  invariant, 
the  angle  of  an  edgel  appearing  in  an  image  is  not. 
Arnold  and  Binford  [Arnold  1980]  derive  additional  an¬ 
alytic  results  which  calculate  the  likelihood  that  an  ar¬ 
bitrary  pair  of  edgel*  will  have  the  same  angle.  They 


show  that  edge  angle  is  quasi- invariant  for  stereo. 


2.2  Curves 


Curves  are  the  second  type  of  structures  chosen  for 
correspondence.  Bach  consists  of  a  set  of  ordered  and 
connected  edgels.  Straight  lines  and  planar  polynomial 
curves  preserve  their  orders  under  perspective  projec¬ 
tion,  i.e.  straight  lines  project  to  straight  hues  and 
planar  polynomial  curves  project  to  polynomial  curves 
in  an  image. 

An  n  degree  planar  curve  in  space  projects  to  an  n 
degree  curve  in  the  image.  In  particular,  conics  map 
into  conics  and  straight  lines  map  into  straight  lines. 
Since  the  degree  of  a  curve  is  preserved  under  projec¬ 
tion,  we  can  match  the  degree  of  fitted  curves  in  a  pair 
of  stereo  images.  However,  in  practice,  we  need  to  have 
a  very  robust  curve  fitting  algorithm  and  enough  infor¬ 
mation  in  the  data.  A  proof  showing  that  the  degree 
of  a  planar  curve  is  invariant  is  given  in  [Lim  1987b]. 
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Figure  7:  Curve  Extension  (Set  1) 

Binford  [Binford  1981]  points  out  that  a  continuous 
curve  in  space  projects  to  a  continuous  curve  in  image. 
We  can  thus  constrain  a  con‘inuous  curve  in  one  im¬ 
age  to  match  only  a  continuous  curve  in  another  image. 
Likewise,  a  break  in  a  space  curve  (tangent  discontinu¬ 
ity)  will  result  in  a  break  in  its  image,  unless  there  is 
a  coincidence  in  which  the  camera  is  coplanar  with  the 
two  tangents  of  the  curve  at  the  break.  Thus,  breaks 
in  image  curves  are  good  structures  for  matching.  Be¬ 
cause  these  breaks  can  be  sharply  localized,  they  also 
lend  to  restrictive  constraints  for  matching. 

2.3  Junctions 

When  two  or  more  curves  terminate  at  a  junction,  we 
infer  that  coincidence  of  these  curves  in  an  image  im¬ 
plies  coincidence  in  space.  A  junction  in  space  corre¬ 
sponds  to  an  intersection  oi  two  or  more  curves.  Junc¬ 
tions  are  named  according  to  the  number  of  edges  form¬ 
ing  the  junctions  and  the  angles  subtended  by  these 
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Figure  8:  Segmented  Surfaces  (Set  1) 

edges  (Figure  11).  A  junction  with  only  two  curves  is 
called  an  L  junction.  A  junction  with  three  curves  is 
called  an  A  junction  when  one  of  the  angles  between 
any  two  curves  is  greater  than  180  degrees,  a  Y  junc¬ 
tion  when  none  of  the  angles  is  greater  than  180  de¬ 
grees,  and  a  T  junction  when  one  of  the  angles  is  equal 
to  180  degrees  A  junction  with  four  or  more  edges  is 
called  an  X  junction  when  none  of  the  angles  is  180 
degrees  and  a  K  junction  when  one  of  the  angles  is  180 
degrees. 

The  termination  of  a  continuous  image  curve  (a  T 
junction)  can  result  from  three  different  occurren'-es: 

•  occlusion  of  an  edge  (Figure  12) 

•  termination  of  a  surface  marking  at  the  visible 
edge 

•  surface  marking 

In  the  first  case,  the  formation  of  T  junctions  is 
viewpoint-dependent  and  the  projections  will  not,  in 
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Figure  9:  Segmented  Bodies  (Set  1) 


general,  appear  on  a  pair  of  coiresponding  epipolar 
lines.  We  refer  to  the  two  horizontal  curves  in  the 
example  forming  the  T  junction  as  the  top  and  the 
vertical  curve  as  the  item  (Figure  11).  The  T  junc¬ 
tions  formed  in  the  last  two  cases  can  usually  be  dis¬ 
tinguished  from  the  first  case.  In  the  last  two  cases, 
since  the  T  junctions  correspond  to  a  physical  point  in 
three-dimensional  space,  the  projections  are  viewpoint- 
independent  and  they  project  to  a  pair  of  corresponding 
epipolar  lines. 

A  T  junction  gives  strong  local  evidence  for  an  occlu¬ 
sion  [Binford  1981, Malik  1984j.  Surfaces  are  not  con¬ 
nected  across  a  T  junction,  We  infer  that  the  surface 
bounded  by  the  top  curves  occludes  the  other  surfaces 
(Figure  12).  For  junctions  of  higher  order,  we  hypothe¬ 
size  that  any  two  curves  forming  an  angle  of  180  degrees 
belong  to  one  surface. 
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Figure  10:  Projection  of  Reconstruction  (Set  1) 
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Figure  11:  Classification  of  J  ':tions 

2.4  Surfaces 

Surfaces  may  be  curved  or  planar.  The  image  of  each 
surface  is  bounded  by  a  set  of  curves  and  junctions. 
Images  of  surfaces  may  have  closed  or  open  bound¬ 
aries.  Images  of  surfaces  with  open  boundaries  result 
when  parts  of  the  surfaces  are  occluded  and  T  junctions 
are  formed  (Figure  12).  Open  boundaries  may  also  be 
formed  when  complete  information  is  not  available,  e.g. 
when  some  edges  are  only  partially  detected  due  to  low 
contrast  in  image  intensities. 

2.5  Bodies 

A  body  is  a  closed,  connected  component  of  surfaces  in 
three-dimensional  space.  In  monocular  interpretation, 
adjacent  surfaces  are  connected  unless  their  shared 
bounding  curve  corresponds  to  a  depth  discontinuity. 
Occlusion  is  indicated  by  T  junctions  along  the  bound- 
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Figure  12:  Occlusion  of  Surface  and  Body 


A-B-D 


A-C-D 


Figure  13:  Order  of  Opaque  Surfaces  is  Preserved 


ary  (Figure  12). 

The  order  of  surfaces  along  the  baseline  of  the  cam¬ 
eras  is  an  invariant  property.  That  is,  if  surface  A  is  to 
the  left  of  surface  B,  then  this  order  is  preserved  in  the 
images.  This  is  true  for  opaque  surfaces  (Figure  13). 


3  Matching 

In  this  section,  we  first  discuss  constraints  which  must 
be  satisfied  by  matches.  The  matching  process  is  then 
described  in  detail.  The  computational  complexity  of 
our  matching  scheme  is  compared  to  that  of  others. 
The  result  shows  that  the  time  required  for  matching 
is  reduced  by  at  least  four  orders  of  magnitude  for  our 
method. 

Constraints  used  to  limit  computation  and  reduce 
ambiguity  include: 

•  Continuity  Constraint 

•  Disparity  Constraint 

•  Epipolar  Constraint 

•  Hierarchical  Constraint 

3.1  Continuity  Constraint 

Continuity  of  disparity  along  extended  curves  is  im¬ 
plicit  in  our  choice  of  structures.  We  fit  extended  curves 
to  detected  edgels.  A  pair  of  edgels  in  two  images  only 
match  if  the  curves  to  which  they  belong  are  matched. 
Thus  two  pairs  of  matched  edgels  on  consecutive  epipo¬ 
lar  lines  must  have  a  continuous  varying  disparity.  This 
is  different  from  the  assumption  of  continuous  disparity 
within  epipolar  lines. 

3.2  Disparity  Constraint 

We  utilize  the  most  general  disparity  constraint  and 
assign  only  the  weakest  bounds  on  the  disparity  range. 
With  the  current  set  up,  where  the  axis  of  the  two 
cameras  are  parallel  and  perpendicular  to  the  baseline, 
the  disparity  must  be  positive.  On  the  average,  this 
constraint  alone  reduces  the  search  space  by  half,  from 
the  whole  epipolar  line  to  half  of  the  epipolar  line.  This 
constraint  has  been  used  previously. 


2.6  Hierarchical 
Structures 


Arrangement  of  33  ^P01"  Constraint 


The  structures  detected  from  each  image  are  grouped 
and  arranged  into  a  hierarchical  structure  (Figure  3). 
This  structure  is  divided  into  five  levels,  from  top  to 
bottom:  bodies,  surfaces,  junctions,  curves,  and  edgels. 
The  number  of  structures  in  each  level  is  reduced  as 
we  go  up  the  hierarchical  structure  (Figure  1)  and 
the  structures  become  more  global.  To  assure  global 
matching,  we  start  matching  at  the  highest  level.  The 
matching  process  is  discussed  in  the  next  section. 


The  epipolar  constraint  requires  that  points  on  a  struc¬ 
ture  in  one  image  must  correspond  to  points  on  struc¬ 
tures  only  on  the  corresponding  epipolar  line  in  the 
other  image.  This  effectively  reduces  the  search  from 
a  two-dimensional  space  to  a  one-dimensional  space. 
With  our  camera  setting,  only  horizontal  disparity  ex¬ 
ists  between  a  pair  of  corresponding  structures. 

We  extend  the  epipolar  constraint  to  curves,  sur¬ 
faces,  and  bodies.  The  epipolar  constraint  holds  for  ev¬ 
ery  point  on  a  curve;  hence  for  two  curves  in  two  views 
to  correspond,  for  every  point  on  one,  a  corresponding 
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B2  matches  B4;  A  >  8 

Figure  14:  Inequality  Constraint  on  Bodies 

point  on  the  other  must  be  found  on  the  same  epipolar 
plane,  or  the  point  must  be  obscured.  Obviously,  the 
same  is  true  of  images  of  surfaces  which  are  bounded 
by  curves,  and  images  of  bodies.  This  constraint  in¬ 
volves  two-dimensional  information  which  is  non-local 
and  provides  a  more  global  construct  than  that  which 
can  be  derived  from  neighborhood  pixels  alone. 

Occlusion  is  identified  by  monocular  interpretation 
and  provides  an  inequality  constraint  on  corresponding 
curves  in  the  other  view.  They  must  correspond  on 
each  epipolar  line  or  be  obscured.  In  Figure  14,  the 
curve  Cl  of  body  B 2  is  obscured  at  epipolar  line  A, 
whereas  the  curve  Cr  of  body  B4  is  not.  Thus  we  can 
set  up  an  inequality  constraint,  requiring  that 

A>B ,  (1) 

<2  matches  34.  In  Figure  15,  curve  Cl  and  curve 
’>  are  obscured  at  epipolar  line  B  and  A  respectively, 
wherea  curve  Cl  and  the  top  portion  of  curve  C4  are 
not.  Hence  we  can  set  up  two  inequality  constraints, 
requiring  that 

C  >  A,  and  (2) 

B  <  C  (3) 

for  the  matching  of  51  and  52,  and  53  and  54  respec¬ 

tively. 

3.4  Hierarchical  Constraint 

All  the  constraints  introduced  in  the  previous  discus¬ 
sion  are  matching  constraints.  Any  match  must  satisfy 
these  constraints.  They  reduce  the  number  of  possible 
matches  for  a  given  structure.  As  a  result,  the  total 
size  of  the  search  space  is  reduced,  but  it  is  important 
to  note  that  all  discussion  has  been  independent  of  a 
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Figure  15:  Inequality  Constraint  on  Surfaces 


particular  search  strategy  to  be  employed  in  finding 
corresponding  matches.  The  constraints  hold  for  any 
search  strategy. 

When  more  than  one  match  at  a  lower  level  is  possi¬ 
ble  for  a  particular  structure,  the  correspondence  in¬ 
formation  at  a  higher  level  may  be  used  to  resolve 
this  ambiguity.  For  example,  if  there  is  more  than  one 
match  for  a  surface,  only  surfaces  belonging  to  a  pair  of 
matched  bodies  are  considered.  This  hierarchical  dis¬ 
ambiguating  technique  is  an  efficient  search  algorithm. 

3.4.1  Matching  of  Bodies 

A  hierarchy  of  image  structures  is  segmented  from  each 
image  independently  before  the  matching  stage  ns  de¬ 
scribed  in  the  previous  section.  Matching  (Figure  4) 
starts  from  the  highest  level  of  the  structure:  the  body 
level.  A  pair  of  bodies  can  be  matched  potentially 
only  if  they  have  the  same  maximum  and  minimum  ex¬ 
tents.  We  define  the  maximum  and  minimum  extents  of 
an  image  structure  as  the  maximum  and  minimum  v- 
coordinates  spanned  by  the  image  structure.  For  bodies 
that  are  partially  occluded,  inequality  constraints  are 
used  as  described  above.  The  following  simple  analy¬ 
sis  shows  that  the  probability  of  two  different  bodies 
having  the  same  extents  is  very  small. 

Assume  an  image  with  L  epipolar  lines  and  B  bod¬ 
ies.  Assume  that  none  of  the  bodies  are  occluded.  If 
the  bodies  are  randomly  distributed  in  an  image  then 
the  probability  of  two  different  bodies  having  the  same 
maximum  extents  on  the  same  epipolar  line  is 


If  all  bodies  have  the  same  size  in  an  image,  then  the 
probability  of  two  different  bodies  having  the  same  min¬ 
imum  extents  on  the  same  epipolar  line  is  the  same. 
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Left  Image  Right  Image 

Figure  16:  Multiple  Potential  Matches  for  a  Body 


However,  different  objects  have  different  sizes  and 
their  dimensions  in  an  image  are  different.  Even  for 
similar  objects,  their  dimensions  in  an  image  are  differ¬ 
ent  when  their  orientations  and  distances  with  respect 
to  the  viewpoints  are  different.  Assume  the  number  of 
different  dimensions  possible  for  all  bodies  in  an  im¬ 
age  is  5,  and  that  these  dimensions  are  randomly  dis¬ 
tributed  over  all  bodies;  then  the  probability  of  two 
different  bodies  having  the  same  dimension  is  given  by 


Thus,  the  probability  of  two  different  bodies  having  the 
same  maximum  and  minimum  extents  is  given  by  the 
product  of  Equation  4  and  Equation  5,  which  is  equal 
to 

vrs-  <6> 

A  pair  of  potentially  matched  bodies  are  considered 
matched  if  at  least  one  pair  of  their  surfaces  match.  If 
more  than  one  body  candidate  matches  another  body 
in  the  other  image,  then  the  one  with  more  matched 
surfaces  is  chosen.  We  use  a  simple  one- level  backtrack¬ 
ing  mechanism  which  allows  backtracking  if  a  pair  of 
bodies  is  incorrectly  matched.  If  none  of  the  surfaces 
can  be  matched,  then  the  potentially  matched  pair  is 
considered  unmatched. 

In  Figure  16,  LBl,  LB 2,  and  LB 3  in  the  left  image 
are  considered  potential  matches  for  RB  in  the  right 
image  because  they  have  the  same  maximum  and  min¬ 
imum  extents.  Since  no  part  of  surface  LB  1  matches 
any  part  of  surface  RB ,  backtracking  is  invoked  and 
LB  1  is  no  longer  considered  a  potential  match.  LB 2 
and  LB 3  are  considered  matched  since  both  of  them 
have  at  least  one  surface  which  matches  one  surface 
of  RB  (in  the  sense  that  they  both  have  the  same  ex¬ 
tents).  However,  LB 3  is  finally  chosen  because  it  has 
more  matched  surfaces  with  RB. 
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B2  matches  B4  S2  matches  only  S4 

Figure  17:  Matching  of  Surfaces 


3.4.2  Matching  of  Surfaces 

Matching  a  pair  of  surfaces  from  two  images  will  only 
be  considered  if  they  belong  to  a  pair  of  potentially 
matched  bodies.  In  Figure  17,  assume  B 1  matches  B 3. 
Without  the  hierarchical  constraint,  51  can  match  ei¬ 
ther  53  or  54  since  they  span  the  same  top  and  bot¬ 
tom  epipolar  lines.  However,  with  the  hierarchical  con¬ 
straint,  only  the  matching  of  51  and  53  is  considered 
since  they  belong  to  a  pair  of  matched  bodies.  The 
matching  of  51  and  54  will  not  be  considered  at  all. 
Thus  the  combinatorial  matches  between  surfaces  are 
reduced. 

A  pair  of  surfaces  is  considered  potentially  matched  if 
they  have  the  same  extents.  An  analysis  similar  to  that 
for  bodies  indicates  that  the  probability  of  two  differ¬ 
ent  surfaces  having  the  same  maximum  and  minimum 
extents  is  also  very  small. 

A  pair  of  potentially  matched  surfaces  is  considered 
matched  if’  at  least  one  pair  of  their  curves  match  (i.e. 
two  pairs  of  connected  junctions  match).  If  more  than 
one  surface  candidate  matches  another  surface  in  the 
other  image,  then  the  one  with  more  matched  curves 
is  chosen.  Again  we  use  a  simple  one-level  backtrack¬ 
ing  mechanism  which  allows  backtracking  if  a  pair  of 
surfaces  is  incorrectly  matched.  If  none  of  the  curves 
can  be  matched,  then  the  potentially  matched  pair  is 
unmatched. 

In  Figure  18,  L51,  LS 2,  and  L53  in  the  left  image 
are  considered  potential  matches  for  RS  in  the  right 
image  because  they  have  the  same  maximum  and  min¬ 
imum  extents.  Since  no  curve  of  LSI  matches  any 
curve  of  RS,  backtracking  is  invoked  and  LSI  is  no 
longer  considered  a  potential  match.  LS 2  and  LS 3  are 
considered  matched  since  both  of  them  have  at  least 
one  curve  which  matches  one  curve  of  RS  (in  the  sense 
that  they  both  have  the  same  extents).  However,  LS 3 
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Figure  18:  Multiple  Potential  Matches  for  a  Surface 
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Figure  19:  Matching  of  Curves  and  Junctions 


is  finally  chosen  because  it  has  more  matched  curves 
with  RS. 

3.4.3  Matching  of  Junctions  and  Curves 

There  is  a  parallel  between  matching  of  surfaces  belong¬ 
ing  to  a  pair  of  matched  bodies,  and  matching  junctions 
and  curves  belonging  to  a  pair  of  matched  surfaces.  In 
Figure  19  a  pair  of  junctions  and  curves  are  matched 
only  if  they  belong  to  a  pair  of  potentially  matched 
surfaces.  Thus,  using  the  hierarchical  constraint  and 
assuming  SI  matches  S3;  only  C5  will  be  considered  a 
possible  match  for  Cl,  and  not  C 7,  even  though  both  of 
them  have  the  same  dimensions  as  Cl.  This  again  saves 
computational  time  by  reducing  possible  matches.  A 
pair  of  junctions  axe  matched  if  they  are  on  the  same 
epipolar  line  and  a  pair  of  curves  are  matched  if  they 
have  compatible  extents. 

Once  a  pair  of  surfaces  is  matched,  the  junctions 
and  curves  belonging  to  the  matched  surfaces  may  be 
matched  accordingly  since  they  are  arranged  in  ordered 
lists.  After  matching  of  the  curves,  again  the  edgels  be¬ 
longing  to  the  matched  curves  may  be  matched  easily 
since  they  are  also  arranged  in  ordered  lists. 


3.4.4  Hierarchical  versus  Coarse-to-flne 
Matching 

There  are  similarities  and  differences  between  the  hier¬ 
archical  approach,  and  the  coarse-to-fine  control  strat¬ 
egy  for  matching  discrete  "interest  points”  used  by 
Moravec  or  for  matching  zero-crossings  proposed  by 
Marr  and  Poggio. 

The  basic  idea  is  to  limit  the  search  space  of  possible 
matches.  In  the  coarse-to-fine  approach,  initial  match¬ 
ing  of  structures  uses  a  coarse  representation  of  images 
and  the  density  of  structures  is  greatly  reduced.  The 
reduction  of  structure  density  reduces  the  search  space 
and  makes  matching  easier.  However,  this  comes  at  the 
expense  of  possible  mistakes  in  the  initial  matches.  The 
larger  the  scale,  the  more  likely  it  is  that  the  support  of 
the  zero  crossing  will  cross  boundaries  and  that  no  real 
correspondence  exists.  In  the  hierarchical  approach, 
we  reduce  the  density  of  structures  by  abstraction  of 
higher- level  symbolic  structures.  There  is  no  change  in 
scale. 

In  both  approaches,  the  initial  matches  are  used  to 
constrain  the  matching  of  finer  detailed  or  lower-level 
representations.  This  again  reduces  the  search  space 
of  the  matching  process.  The  estimated  disparities  of 
matching  at  a  coarser  level  are  used  to  constrain  the 
disparity  of  possible  matches  at  a  finer  level  in  the 
coarse-to-fine  approach.  The  assumption  of  similar  dis.- 
parity  is  valid  only  for  surfaces  varying  in  a  smooth 
manner  relative  to  the  viewer.  In  the  hierarchical  ap¬ 
proach,  reasoning  and  inference  are  used  to  interpret 
the  structures  obtained.  Thus  gTeat  changes  in  dispar¬ 
ities  are  allowed  and  accounted  for. 


3.5  Complexity  of  Matching 

We  analyze  the  complexity  of  matching  body  extents 
and  compare  it  to  previous  methods  of  area  correlation 
and  matching  edgels. 

3.5.1  Matching  Edgels 

The  number  of  possible  matches  can  be  reduced  con¬ 
siderably  by  matching  only  simple  structures  such  os 
edgels.  As  we  pointed  out  earlier,  matching  edgels  re¬ 
duces  the  space  of  possible  correspondence  by  attempt¬ 
ing  to  restrict  the  computation  to  interest  points  in  an 
image. 

Let  there  be  E  edgels  in  an  image,  the  number  of  pos¬ 
sible  correspondences  under  a  straight  forward  search 
is 

E2.  (7) 

The  search  space  may  be  reduced  by  applying  the 
epipolar  constraint.  Let  there  be  L  epipolar  lines  in 
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each  image.  On  the  average  there  are  E/L  edge  la  on 
each  epipolar  line  and  each  edgel  can  find  E/L  possible 
matches  in  the  other  images.  Therefore,  the  number  of 
possible  matches  within  an  epipolar  plane  is 
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(8) 


For  the  whole  image,  there  are  L  epipolar  lines.  The 
total  number  of  possible  matches  is 
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L  ' 


(9) 


We  compare  this  result  to  the  total  number  of  possible 
matches  using  bodies. 


Comparing  this  to  Equation  9,  the  matching  complex¬ 
ity  is  reduced  by  a  factor  of 
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(17) 


For  a  typical  image,  the  average  number  of  edgels 
per  curve  is  around  30,  the  average  number  of  curves 
per  surface  is  around  four,  and  the  average  number  of 
surfaces  per  body  is  around  three.  Thus,  we  have  a  re¬ 
duction  of  more  than  four  orders  of  magnitude  in  com¬ 
putation  time.  The  reduction  in  computation  time  is 
gained  at  the  expense  of  segmentation  and  monocular 
interpretation.  However,  the  segmentation  and  inter¬ 
pretation  also  provide  additional  meaningful  results  on 
the  recovered  depth  data. 


3.5.2  Matching  Bodies 

The  number  of  possible  matches  can  be  further  reduced 
by  matching  higher-level  structures.  Let  there  be  E 
edgels,  C  curves,  5  surfaces,  and  B  bodies  in  an  image 
after  segmentation,  where 


c  = 
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KV 

(10) 

s  = 

h “d 

(11) 

B  = 

s 

K , 

(12) 

K,,  and  K\  are  the  average  number  of  edgels, 
curves,  and  surfaces  forming  a  curve,  a  surface,  and 
a  body  respectively. 

Each  body  has  a  unique  maximum  extent.  Thus  the 
average  number  of  maximum  extents  per  epipolar  line 
is 


Each  of  them  has  B/L  possible  matches  in  the  other 
image.  Therefore,  the  possible  matches  of  maximum 
extents  between  a  pair  of  epipolar  lines  is 
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(14) 


For  the  whole  image,  there  are  L  epipolar  lines  and  the 
total  number  of  possible  matches  is 
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(15) 


Substitute  Equations  10,. . .  ,12  into  Equation  15,  to  get 


E2  1 
L  *  K2K2K2 


(16) 


4  Implementation  and  Results 

4.1  Edgel  Detection  and  Curve  Fitting 

A  stereo  pair  of  gray-scale  images  is  obtained  from  a 
CCD  camera  (Figure  5).  Edgels  are  detected  by  ap¬ 
plying  Nalwa’s  directional  edge  operator  [Nalwa  1986] 
to  the  pair  of  images.  These  edgels  are  first  aggre¬ 
gated  into  ordered  sets  corresponding  to  individual  ex¬ 
tended  edges.  Straight  lines  and  conic  sections  are 
then  fitted  to  these  edgels  in  a  segment  in  a  best-fit 
sense  [Nalwa  1987],  Position  and  tangent  continuity 
are  preserved  at  this  stage  (Figure  6). 

4.2  Curve  Extension 

Nalwa’s  edge  operator  is  designed  to  detect  a  single 
step-edge  profile  within  each  window.  When  more  than 
one  edge  appears  within  the  same  window,  the  edge 
operator  misses  edges  and  estimates  edge  parameters 
inaccurately.  In  particular,  edges  are  not  properly  de¬ 
tected  around  junctions.  Thus,  the  curves  obtained  by 
the  curve-fitting  process  are  usually  incomplete  near 
junctions  where  three  or  more  curves  meet.  Only  a  few 
junctions  are  found  at  this  stage.  However,  the  infor¬ 
mation  of  junctions  is  critical  to  the  segmentation  of 
surfaces  and  bodies. 

We  extend  curves  near  junctions  in  order  to  estimate 
junctions.  The  tangent  of  a  fitted  curve  near  a  junction 
is  a  good  indication  of  the  orientation  of  the  nearby 
missing  edgels.  The  contrast  across  the  fitted  curve 
also  gives  a  good  estimate  of  the  contrast  across  the 
missing  edgels. 

A  directional  difference  operator  is  chosen  to  detect 
the  missing  edges  around  junctions.  This  operntor  is 
applied  to  regions  where  end  points  of  curves  are  not 
connected  to  any  junction.  The  direction  and  contrast 
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across  the  newly  detected  edgels  are  constrained  by  the 
direction  of  the  tangent  of  the  fitted  curve  and  the  con¬ 
trast  across  the  curve.  The  direction  of  the  new  edgels 
must  be  within  ir/4  radians  of  the  tangent  of  the  fit¬ 
ted  curve  and  the  contrast  across  the  new  edgel  must 
be  at  least  half  of  the  original  contrast  across  the  fitted 
curve.  Therefore  the  contrast  required  for  the  detection 
of  a  missing  edgel  is  dynamically  adjusted  according  to 
previously  detected  edgels,  it  is  not  set  a  priori.  This 
smaller  operator  operates  on  the  intensity  values  of  the 
underlying  gray-scale  image. 

Curves  with  length  less  than  a  fixed  threshold  are 
not  extended.  This  avoids  extensions  on  noisy  curves 
in  the  image.  Figure  7  show  the  results  after  curve 
extension. 

4.3  Junction  Classification 

New  junctions  formed  by  curve  extension  and  old  junc¬ 
tions  obtained  directly  from  curve  fitting  are  classified 
by  their  types  and  orders  as  described  in  Section  2. 

In  the  implementation,  because  of  noise  in  the  data, 
we  allow  a  variation  of  five  degrees  for  determining 
T  junctions  and  K  junctions  where  one  of  the  angles 
should  be  180  degrees. 

4.4  Surface  Segmentation 

The  images  of  surfaces  have  two  kinds  of  boundaries: 
open  surface  boundaries  and  closed  surface  boundaries. 
An  open  surface  boundaries  may  occur  if  a  surface  is 
occluded  or  if  a  surface  has  missing  edges. 

Tracing  surface  boundaries  is  carried  out  by  left-wall 
(or  right- wall)  following.  For  left- wall  following,  the 
tracing  starts  at  a  junction.  It  follows  a  curve  of  that 
junction  and  takes  the  closest  counter-clockwise  curve 
whenever  it  arrives  at  another  junction  (Figure  20). 
Exceptions  arise  when  the  tracing  comes  to  a  T  junc¬ 
tion  or  a  A  junction.  If  the  last  curve  that  is  be¬ 
ing  traced  is  tangent  to  any  other  curve,  then  it  will 
take  the  tangent  curve  instead  of  the  closest  counter¬ 
clockwise  curve.  Also,  if  the  closest  counter-clockwise 
curve  is  tangent  to  another  curve,  then  the  closest- 
clockwise  curve  will  not  be  chosen.  This  is  based  on 
the  assumption  that  curves  that  are  tangent  to  each 
other  belong  to  the  same  surface  (Figure  21). 

Tracing  which  starts  from  the  stem  of  a  T  junction 
must  end  at  the  stem  of  another  T  junction  or  a  junc¬ 
tion  of  order  one,  forming  an  open  surface,  e.g.  surfaces 
52  and  54  in  Figure  20.  Tracing  which  starts  from  a 
junction  of  order  one  can  stop  at  another  junction  of 
order  one  or  at  the  stem  of  a  T  junction.  If  the  tracing 
starts  at  an  A  or  Y  junction,  it  stops  when  it  comes 
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Figure  20:  Tracing  Surfaces 


Figure  21:  Tracing  Surfaces  with  T  and  K  Junctions 


back  to  the  starting  junction,  making  a  loop,  e.g.  sur¬ 
faces  51  and  53  in  Figure  20.  Thus  a  closed  surface  is 
formed.  If  it  comes  to  a  T  junction  from  the  stem  or 
a  junction  of  order  one,  then  no  surface  is  segmented. 
Any  surface  boundary  which  ends  at  a  stem  of  a  T 
junction  or  a  junction  of  order  one  is  taken  care  of  by 
tracings  which  start  from  the  stem  of  a  T  junction.  For 
example,  the  tracing  which  starts  at  junction  A  and  fol¬ 
lows  curve  C  comes  to  a  T  junction  but  no  surface  is 
formed.  Instead  surface  54  is  formed  when  the  tracing 
starts  from  the  stem  of  a  T  junction. 

The  tracing  process  repeats  itself  for  every  curve  of 
a  junction  and  for  every  junction.  After  tracing  curves 
belonging  to  all  junctions,  there  are  still  a  few  curves 
untraced.  They  are  those  belonging  to  closed  surfaces 
that  do  not  have  any  junctions,  e.g.  a  projection  of 
a  sphere.  All  the  curves  that  have  been  traced  are 
marked  and  any  curves  that  remain  untraced  are  then 
traced.  Surfaces  that  do  not  enclose  more  than  180 
degrees  are  not  considered.  This  eliminates  noisy  data 
such  ns  some  isolated  curves  in  the  center  of  the  larger 
sphere  in  Figure  7.  The  results  of  surface  segmentation 
on  Figure  7  are  shown  in  Figure  8.  Each  surface  is 
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offset  for  clarity. 

4.5  Body  Segmentation 

A  body  is  a  connected  component  of  surfaces  in  three- 
dimensional  space.  Surfaces  which  are  coincident  are 
grouped  as  bodies,  i.e.  for  which  there  is  evidence  for 
coincidence.  A  surface  which  occludes  another  surface 
is  not  coincident  with  it  across  the  occluding  bound¬ 
ary,  although  they  may  be  coincident  along  another 
boundary.  Imperfect  surface  segmentation  can  lead  to 
grouping  of  more  than  one  object  into  a  body.  Bodies 
form  the  highest  level  of  segmentation  in  the  hierarchi¬ 
cal  structure. 

Results  from  grouping  surfaces  are  shown  in  Figure  9. 
The  small  sphere  at  the  back  and  the  truncated  pyra¬ 
mid  are  not  joined  with  objects  in  front  because  of  in¬ 
ference  of  occlusion  from  T  junctions.  Again  the  bodies 
are  offset  for  clarity. 

4.6  Matching  Results 

Bodies,  surfaces,  curves,  junctions,  and  edgels  ex¬ 
tracted  from  the  stereo  pair  of  images  are  then  used 
for  matching  as  described  in  Section  3.  The  projection 
of  three-dimensional  depth  data  in  shown  in  Figure  10. 

It  is  important  to  note  that  the  range  data  recov¬ 
ered  is  already  segmented  into  three-dimensional  sur¬ 
face  boundaries  and  body  boundaries.  For  each  point 
of  depth  data  recovered,  we  know  that  it  belongs  to  a 
specific  curve  of  a  specific  surface  of  a  specific  body  in 
space.  There  are  symbolic  representations  attached  to 
the  recovered  depth  data.  Using  these  segmented  data, 
it  is  easy  to  perform  surface  interpolation. 

In  previous  approaches,  where  edges  are  matched, 
the  recovered  three-dimensional  raw  data  are  isolated 
data  with  no  meaning  attached  to  them.  Additional 
segmentation  efforts  in  three-dimensional  space  are 
necessary  for  extracting  symbolic  information  from  the 
raw  range  data.  Fitting  surfaces  to  neighboring  edges 
may  not  be  meaningful,  since  the  edges  may  not  belong 
to  the  same  surface. 

4.0.1  Additional  Results 

Results  from  another  set  of  images  and  their  results  are 
shown  in  Figure  22,. . .  ,27. 

4.0.2  High  Resolution  Results 

Under  the  current  implementation,  the  vertical  position 
(with  respect  to  the  image  plane)  of  a  detected  edgel 
is  quantized  to  the  nearest  epipolar  line.  This  results 


Figure  22:  Another  Pair  of  Gray-scale  Images  (Set  2) 


in  a  simpler  and  faster  matching  process.  However,  ac¬ 
curacy  is  lost  and  some  of  the  recovered  depth  data 
have  jagged  boundaries.  This  can  be  seen  in  the  octag¬ 
onal  block  on  the  lower  right  of  Figure  27.  We  show  a 
close-up  view  of  the  octagonal  surface  in  Figure  28. 

To  recover  more  accurate  results,  one  can  do  an  in¬ 
terpolation  on  the  detected  edgels  or  increase  the  res¬ 
olution  of  edgel  position  by  increasing  the  number  of 
epipolar  lines.  We  choose  to  study  the  latter  method. 

We  increase  the  resolution  of  the  detected  edgels 
around  the  octagonal  block  by  five  times,  i.e.  the  num¬ 
ber  of  epipolar  lines  spanned  by  the  octagonal  surface 
is  increased  by  five-fold.  The  matching  result  is  shown 
in  Figure  29.  It  can  be  seen  that  the  boundary  of  the 
octagonal  surface  is  as  smooth  as  the  boundary  in  the 
original  gray-scale  image.  This  indicates  that  the  depth 
data  recovered  is  very  accurate. 

Another  example  is  shown  using  the  sphere  in  Fig¬ 
ure  27.  The  close-up  view  of  the  sphere  is  shown  in 
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Figure  30. 

Again,  we  increase  the  resolution  of  the  detected 
edgels  locally  around  the  sphere  by  five  times.  The 
matching  result  is  in  Figure  31. 

These  results  indicate  that  the  jagged  boundaries  in 
the  projection  of  the  results  are  due  to  the  quantization 
errors  introduced  in  the  implementation  and  not  the 
segmentation  or  matching  processes. 


5  Conclusion  and  Discussion 


The  system  performs  a  high-level  stereo  correspondence 
which  achieves  global  correspondence  by  means  of  high- 
level  structures:  surfaces  and  bodies.  The  main  do¬ 
mains  of  application  are  in  images  where  there  are 
monocular  structures.  The  performance  of  the  system 
is  based  on  its  ability  to  extract  these  structures  for 
matching.  It  utilizes  geometric  constraints,  interpre¬ 
tation  on  occlusion  and  coincidence,  and  knowledge  of 


surface  and  body. 

The  system  is  not  designed  for  random-dot  stere- 
ogTams  where  no  monocular  structure  can  be  found  in 
the  images.  Its  performance  is  also  limited  on  nat¬ 
ural  scenery  where  the  structures  of  objects  are  not 
well-defined.  We  have  tried  our  system  on  a  pair  of 
aerial  photos  of  urban  area.  We  experience  problems 
during  the  segmentation  stage.  Regions  in  natural 
scenery  usually  do  not  have  well-defined  structures  and 
that  makes  segmentation  difficult.  A  leaf  may  have  a 
well-defined  boundary  but  not  a  tree.  The  system  is 
not  able  to  segment  meaningful  structures  from  the 
detected  edgels  for  matching.  Textured  surfaces  also 
create  problems  for  segmentation.  The  edge  operator 
tends  to  detect  edgels  on  a  texture  surface,  but  the 
curve  fitting  process  cannv.  distinguish  between  edgels 
forming  the  boundaries  of  a  surface  and  others  that 
arise  from  surface  texture. 

We  showed  that  by  segmentation  and  monocular  in¬ 
terpretation,  we  are  able  to  match  surfaces  and  bodies. 
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Figure  25:  Segmented  Surfaces  (Set  2) 


This  results  in  reduction  of  computational  complexity 
and  avoidance  of  local  mismatches.  Since  the  struc¬ 
tures  for  matching  are  segmented,  the  recovered  three- 
dimensional  depth  data  are  also  segmented.  This  is 
very  useful  for  surface  interpretation.  In  contrast,  pre¬ 
vious  approaches  use  sparse  edgels  for  correspondence 
and  segmentation  effort  concentrates  on  the  recovered 
sparse  depth  data. 
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Figure  29:  Recovered  Octagonal  Surface  from 

High-resolution  Data 


Figure  30:  Close-up  View  of  the  Sphere 


Figure  31:  Recovered  Sphere  from  high-resolution  Data 
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Abstract 


An  inherent  problem  in  stereo  correspondence  is  that 
apparent  boundaries  of  a  curved  surface  from  a  pair 
of  stereo  images  do  not  correspond  to  the  same  phys¬ 
ical  points  on  the  curved  surface;  therefore,  they  can¬ 
not  be  matched  directly.  In  this  paper,  we  describe  a 
high-level  stereo  vision  system  which  replaces  the  inap¬ 
propriate  correspondence  constraint  with  an  accurate 
constraint  that  the  curved  surface  is  tangent  to  lines  of 
sight  along  four  rays  on  each  epipolar  plane.  We  also 
derive  equations  from  apparent  boundaries  of  a  curved 
surface  in  both  images  in  order  to  reconstruct  an  ap¬ 
proximation  to  the  curved  surface  in  space.  The  termi¬ 
nator,  an  edge  bounding  the  curved  surface,  is  used  as 
an  additional  constraint  whenever  it  is  available.  This 
is  particularly  applicable  to  objects  that  can  be  de¬ 
scribed  in  terms  of  generalized  cylinders.  Surface  maps 
are  obtained  from  reconstructed  curved  surfaces;  thus, 
surface  interpolation  is  not  necessary. 


1  Introduction 


Apparent  boundaries  of  curved  surfaces  appear  where 
surface  normals  are  perpendicular  to  the  line  of  sight, 
i.e.  where  the  surface  is  tangent  to  the  line  of  sight.  We 
refer  to  these  apparent  boundaries  as  /»m6j  (Figure  1). 
Limbs  are  distinct  from  "true”  edges  of  surfaces  which 
correspond  to  a  discontinuity  of  surface  normal  at  the 
intersection  of  surfaces.  For  a  limb,  the  curve  in  space 
which  projects  to  the  image  curve  depends  on  view¬ 
point.  On  the  other  hand,  true  edges  are  viewpoint- 
independent.  At  a  limb,  the  surface  has  continuous  tan¬ 
gent.  Limbs  are  distinguished  from  edges  which  are  dis¬ 
continuities  in  surface  reflectivity.  Limbs  coincide  with 
pigment  boundaries,  which  are  viewpoint-independent, 
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Figure  1:  Stereo  Views  of  a  Curved  Surface 


only  from  a  constrained  viewpoint,  i.e.,  only  on  a  set 
of  measure  zero  [Binford  1987]. 

In  Figure  1,  the  boundaries  of  a  cylinder  as  seen  in 
the  left  image  correspond  to  curves  A  and  C  on  the  sur¬ 
face  of  the  cylinder,  while  the  boundaries  of  the  cylinder 
as  seen  in  the  right  image  correspond  to  curves  B  and 
D.  Thus,  curves  from  a  pair  of  images  of  correspond¬ 
ing  limbs  do  not  correspond  to  the  same  physical  curve 
in  the  world;  therefore  we  cannot  recover  the  three- 
dimensional  depth  data  by  simple  triangulation.  It  is 
frequently  proposed  to  take  the  intersection  of  the  tan¬ 
gents  as  the  "edge”.  But  there  are  only  two  of  those 
which  allow  only  a  planar  approximation.  By  using 
accurate  constraints  we  obtain  tangent  and  curvature 
information  which  cannot  be  obtained  by  a  planar  ap¬ 
proximation. 

A  terminator  of  a  curved  surface  is  an  edge  bounding 
the  curved  surface.  In  Figure  1  the  terminator  is  an 
ellipse. 
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Figure  2:  Planar  Cross-section  of  a  Curved  Surface 


We  design  and  implement  a  solution  for  reconstruct¬ 
ing  objects  with  curved  surfaces.  Curved  surface  recon¬ 
struction  has  been  treated  very  little  [Shapira  1978]. 
The  problem  is  that  corresponding  limbs  extracted 
from  stereo  images  are  not  the  same  points  on  a 
curved  surface  as  described  above,  and  they  cannot  be 
matched  directly.  In  the  next  section,  we  show  that  the 
errors  which  are  introduced  by  fitting  a  planar  surface 
to  incorrect  depth  data  instead  of  fitting  a  curved  sur¬ 
face  using  physical  constraints  derived  from  the  cam¬ 
era  geometry.  In  Section  3,  we  show  that  the  lines  of 
sight  from  stereo  views  can  restrict  the  recovered  cross- 
sections  of  a  curved  surface  to  one  degree  of  freedom. 
Additional  constraints  are  employed  to  determine  the 
cross-sections  uniquely.  Each  curved  surface  is  then  re¬ 
constructed  by  a  number  of  planar  conics.  Results  of 
implementation  are  shown  in  Section  4. 


2  Errors  between  Fitting  Pla¬ 
nar  and  Curved  Surfaces 

Consider  an  epipolar  plane  cutting  a  curved  surface 
to  form  a  planar  cross-section.  This  cross-section  is 
tangent  to  the  four  lines  of  sight,  two  from  the  left 
view  and  two  from  the  right  view  (Figure  2). 

There  are  two  types  of  error  in  fitting  a  planar  surface 
instead  of  a  curved  surface  to  image  data  (Figure  3): 
the  error  between  the  true  and  apparent  boundaries  of 
the  reconstructed  surfaces,  and  the  enror  between  the 
true  and  apparent  depths  of  the  reconstructed  surfaces 
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Figure  3:  Errors  of  Fitting  Planar  Surface 

2.1  Boundary  Error 

Consider  part  of  a  cross-section  of  a  curved  surface  in 
Figure  4.  The  lines  of  sight  from  left  and  right  views 
are  tangent  to  the  cross-section.  If  we  consider  the 
intersection  of  two  corresponding  lines  of  sight  to  be 
from  the  same  physical  point,  the  apparent  boundary 
of  the  cross-section  is  at  A.  However,  the  true  boundary 
of  the  cross-section  is  at  B  instead. 

The  error  between  the  true  boundary  and  the  ap¬ 
parent  boundary  is  shown  in  [Arnold  1981]  to  be  equal 
to 

r  *  (s«(^)  -  1).  (1) 

where  r  is  the  radius  of  curvature  of  the  cross-section 
and  a  is  the  angle  between  the  two  lines  of  sight.  It  is 
interesting  to  note  that  the  wider  the  angle  between  the 
two  viewpoints,  the  larger  the  error.  This  is  in  contrast 
to  random  enor  in  triangulation  which  is  smaller  when 
the  angle  between  the  viewpoints  becomes  wider.  In 
this  case,  the  error  is  systematic. 

2.2  Depth  Error 

Consider  the  error  in  depth  between  a  curved  surface 
and  its  planar  approximation.  For  simplicity  of  calcu¬ 
lation,  we  assume  that  the  cross-section  of  the  curved 
surface  is  a  circle  and  that  its  center  is  located  at  a 
distance  d  from  both  cameras  (Figure  5).  The  dis¬ 
tance  between  the  apparent  boundary,  A ,  and  the  true 
boundary,  B,  can  be  shown  to  be 

r(1-rJeCjl]),  (2) 


.v.v.v.-.  . 


and  thus  the  depth  error  is  equal  to 


—  !  =  r(  1-r 


4ec(f), 


3  Curved  Surface  Reconstruc¬ 
tion 

3.1  Fitting  Conics  to  Four  Lines  of 
Sight 

To  reconstruct  a  curved  surface  from  a  pair  of  stereo 
images,  we  first  cut  the  curved  surface  into  a  num¬ 
ber  of  slices,  each  slice  in  an  epipolar  plane.  Within 
each  epipolar  plane,  there  are  four  lines  of  sight,  two 
from  each  viewpoint.  Bach  line  of  sight  is  tangent  to 
the  cross-section  of  the  curved  surface.  We  fit  a  conic 
tangent  to  these  four  lines  of  sight  in  each  epipolar 
plane.  Combining  all  the  reconstructed  conics  in  epipo¬ 
lar  planes  gives  the  curved  surface. 

We  show  that  the  equation  of  a  conic  tangent  to  four 
given  lines  is  given  by 

L2E2  -2L(AC  +  BD)  +  F2  =0,  (16) 

(this  abridged  notation  is  explained  in  the  proof)  where 
A,  B,  C,  and  D  are  the  four  tangent  lines,  E  and  F  are 
the  diagonals  of  the  quadrilateral  formed  by  the  four 
tangent  lines, 

AC  -BD  =  EF ,  (17) 

and  L  is  a  parameter  that  can  be  chosen  freely. 

Proof: 

We  adopt  the  abridged  notation  in  [Salmon  1904], 
where  an  equation  of  a  conic 

ix2  +  bxy  +  cy2  +  dx  +  ey  +  f  =0  (18) 


is  represented  by 


5  =  0, 


i.e.  5  stands  for  ax2 +  bxy  +  cy2  +  dx +  ey+ f .  Similarly, 
an  equation  of  a  straight  line 


ax  +  by  4-  c  =  0 


is  represented  by 


i.e.  A  stands  for  ax  +  by  - 1-  c.  Let  5  =  0,  T  =  0,  and 
V  =  0  be  conics,  A  =  0,  B  =  0,  C  —  0,  D  —  0,  E  =  0, 
and  F  -  0  be  straight  lines,  and  K  and  L  be  constants. 

We  know  that  if  U  =  0  and  V  =  0  are  the  equa¬ 
tions  of  any  two  loci,  then  the  locus  represented  by  the 
equation  U  +  kV  =  0  (where  k  is  any  constant)  passes 


Figure  6:  Lines  A  and  B  Intersect  Conic  5  at  Four 
Points 

through  every  point  common  to  the  two  given  loci.  As¬ 
sume  that  A  and  B  intersect  5  at  four  points  m,n,o, 
and  p,  as  illustrated  in  Figure  6.  It  is  obvious  that  a 
conic  T  passing  through  these  four  intersection  pointB 
is  given  by 

T  =  S  -  KAB,  (22) 

(Figure  7). 

Suppose  that  lines  A  and  B  get  closer  and  closer  to 
each  other  until  they  coincide,  then  point  m  coincides 
with  point  n,  and  point  o  coincides  with  point  p.  The 
conic  T  now  touches  the  first  conic  5  at  m  and  o  (Fig¬ 
ure  8).  Thus,  Equation  22  becomes 

T  =  5  -  KA2  (23) 

which  represents  a  conic  tangent  to  5  at  two  distinct 
points.  A  is  the  chord  of  contacts. 

Next,  we  want  to  find  an  equation  of  a  conic  tangent 
to  two  given  conics  5  and  T.  Assume  that  5  and  T 
intersect  at  four  points  m,  n,  o,  and  p.  E  and  F  are 
their  chords  of  intersection  (Figure  9),  i.e. 

T  =  5  -EF.  (24) 

From  Equation  23,  a  conic  tangent  to  5  can  be  repre¬ 
sented  by 

4L5  -  [LE  +  F)2  =  0,  (25) 

where  LE  +  F  is  the  chord  of  contacts  between  the  new 
conic  and  conic  5,  and  L  is  any  constant.  Similarly,  a 
conic  tangent  to  T  can  be  represented  by 

ALT  -  (LE  -  F)2  =  0,  (26) 
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Figure  7:  Conic  X  Pnsses  through  Four  Points  of  Inter- 


where  LE  -  F  is  the  chord  of  contacts  between  the 
new  conic  and  conic  T.  Combining  Equation  26  and 
Equation  26,  we  have 

L3E2  -2L{S  +  T)  +  F3  =0,  (27) 

which  represents  a  conic  tangent  to  both  S  and  T  (Fig¬ 
ure  9). 

When  the  two  conics  S  and  T  degenerate  into  two 
pairs  of  straight  lines  A,  C  and  B,  D,  we  have 

S  =  AC  and  (28) 

T  =  BD,  (29) 

then  Equation  27  and  Equation  24  become 

L3E3  -  2L{AC  +  BD)  +  F3  =  0,  where  (30) 

AC-BD  =  EF.  (31) 

3.2  Curved  Surface  Reconstruction 
with  Extremum  Constraint 

From  Equation  30,  we  note  that  there  is  a  free  pa¬ 
rameter  L  which  can  assume  any  value.  This  is  to  be 
expected  as  a  general  conic  is  defined  by  five  param¬ 
eters  and  we  have  only  four  constraints,  namely,  the 
four  tangent  lines  from  the  four  lines  of  sight. 

To  determine  the  fifth  parameter,  without  additional 
information,  we  employ  the  extremum  principle  sug¬ 
gested  by  Brady  and  Yuille  [Brady  1983j.  It  maximizes 
a  familiar  measure  of  compactness  or  symmetry  of  sur¬ 
face,  namely  the  ratio  of  the  area  to  the  square  of  the 
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Figure  8:  Conic  T  Tangent  to  Conic  5 


perimeter.  Specifically,  the  measure  is 


in  —  —  . - r .  lo*  j 

Perimeter 2 

This  measure  defines  characteristics  of  a  curve  which 
are  independent  of  the  scale,  and  for  all  possible  curves 
it  is  maximized  by  the  most  symmetric  one,  a  circle. 
The  measure  has  an  upper  bound  of  ~  when  the  curve 
is  a  circle.  Its  lower  bound,  zero,  is  achieved  when  the 
curve  becomes  a  straight  line. 

3.3  Curved  Surface  Reconstruction 
with  Terminator  Constraint 

To  determine  the  fifth  parameter,  we  may  use  other 
information  from  the  image,  such  as  specularity,  mark¬ 
ings,  and  texture  on  a  curved  surface,  or  the  terminator 
of  a  curved  object.  Specularity  provides  viewpoint- 
dependent  information  while  surface  markings,  sur¬ 
face  texture,  nnd  the  terminator  render  viewpoint- 
independent  information.  Recall  that  the  terminator 
is  the  intersection  of  the  surface  with  another,  as  in 
Figure  1. 

For  Linear  homogeneous  generalized  cylinders,  we  can 
use  the  terminator  to  constrain  the  fifth  parameter.  In 
particular,  if  the  terminator  is  a  conic,  the  fifth  con¬ 
straint  may  be  the  ratio  of  the  major  axis  to  the  minor 
axis  of  the  terminator.  The  ratio  is  kept  constant  for 
each  of  the  reconstructed  conics  while  the  size  of  the 
cross-section  may  vary. 

Surface  markings  and  surface  texture  may  provide 
additional  constraints  to  each  of  the  reconstructed  con- 
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Figure  9:  A  Conic  Tangent  to  Two  Conics  5  and  T 


ics.  Within  each  cross-section  of  the  curved  surface,  if 
a  pair  of  corresponding  surface  markings  can  he  identi¬ 
fied,  the  three-dimensional  depth  of  that  marking  may 
be  determined.  Each  provides  an  additional  constraint 
for  uniquely  determining  the  conics. 


4  Results  of  Implementation 

In  this  section,  we  provide  examples  of  reconstruction 
of  curved  surfaces.  As  discussed  in  Section  3,  we  find  a 
best-fit  conic  in  every  epipolar  plane  that  aits  a  curved 
surface.  By  combining  these  fitted  conics,  we  recon¬ 
struct  the  curved  surface.  When  limbs  are  the  only 
information  available,  we  use  the  extremum  constraint 
which  maximizes  the  ratio  of  the  area  to  the  square  of 
perimeter  of  a  conic.  When  additional  information  such 
as  the  terminator  is  available,  we  may  use  the  ratio  of 
the  major  axis  to  the  minor  axis  of  the  terminator  as 
an  additional  constraint. 

In  the  current  implementation,  curved  surfaces  are 
not  segmented  automatically  but  by  hand  although  seg¬ 
mentation  could  have  been  automated.  We  can  distin¬ 
guish  a  curved  surface  from  a  planar  surface  provided 
we  can  identify  the  junctions  correctly.  In  Figure  10,  if 
we  can  distinguish  between  junction  a  and  junction  A, 
or  junction  I  and  junction  L,  then  we  can  identify  the 
the  curved  surface.  In  practice,  it  may  be  difficult  to 
make  the  distinction  because  of  noisy  data. 


Cylinder 


Rectangular  Block 


Figure  10:  Distinctions  between  Curved  and  Planar 
Surfaces 


4.1  Extremum  Constraint 

Figure  12  provides  a  close-up  look  at  a  pair  of  corre¬ 
sponding  cylinders  appearing  at  the  top  right  corner 
of  the  left  and  right  gray-scale  images  in  Figure  11. 
The  corresponding  segmented  results  are  shown  in  Fig¬ 
ure  13.  The  segmentation  process  from  a  gray-scale 
image  is  described  in  [Lim  1987]. 

Within  each  epipolar  plane,  we  fit  a  conic  to  the 
four  lines  of  sight  and  apply  the  extremum  principle 
to  constrain  the  fifth  parameter.  When  all  the  fitted 
conics  are  stacked  together,  a  curved  surface  is  recov¬ 
ered  which  is  shown  in  Figure  14.  Projection  of  the 
same  curved  surfaces  from  a  higher  angle  is  shown  in 
Figure  15. 

The  limbs  of  the  reconstructed  cylinder  appear  ir¬ 
regular.  The  dimensions  of  the  reconstructed  conics 
are  very  sensitive  to  the  data  of  the  limbs  in  the  stereo 
images.  The  quantization  errors  of  the  limbs  in  the  im¬ 
ages  contribute  to  the  quantum  jumps  of  the  limbs  in 
space. 

Figure  16  is  a  close-up  view  of  a  stereo  pair  of 
matched  spheres  appearing  at  the  center  left  of  both 
images  in  Figure  11.  The  segmented  results  appear  in 
Figure  17. 

Again  the  extremum  principle  is  applied  to  constrain 
the  fifth  parameter.  The  recovered  conics  are  con¬ 
strained  to  be  as  circular  as  possible.  The  results  from 
different  viewpoints  are  shown  in  Figure  18  and  19 
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4.2  Terminator  Constraint 


When  the  terminator  is  available,  an  additional  con¬ 
straint  can  be  derived  from  it.  Prom  a  pair  of  matched 
terminators,  we  calculate  the  depth  of  every  pair  of  cor¬ 
responding  edgels  belonging  to  the  terminator.  A  set 
of  three-dimensional  depth  data  is  obtained.  The  data 
are  fitted  with  a  planar  surface  and  then  a  conic.  The 
ratio  of  the  major  axis  to  the  minor  axis  is  calculated 
and  used  as  an  additional  constraint. 

Figure  21  shows  a  pair  of  cylinders  with  observable 
terminators  which  appeared  in  Figure  20. 

Results  from  fitting  conics  to  each  cross-section  of  the 
curved  surface  and  stacking  them  together  are  shown 
in  Figure  22.  The  curved  surface  in  Figure  23  is  viewed 
from  a  different  angle. 

Figure  24  shows  the  result  of  fitting  conics  to  the 
same  data  using  the  extremum  constraint  instead  of 
the  terminator  constraint.  It  is  observed  that  the  limbs 
in  Figure  23  appears  to  be  more  regular  and  less  sus¬ 
ceptible  to  the  quantization  errors  of  the  limbs  in  the 
images.  The  extremum  constraint  maximizes  the  ratio 
of  area  to  the  square  of  perimeter  for  each  conic  inde¬ 
pendently  whereas  the  terminator  constraint  requires 
all  the  conics  to  have  the  same  ratio  of  major  axis  tc 
minor  axis  for  all  the  conics.  Thus  the  terminator  pro¬ 
vide  a  more  global  constraint  which  is  less  susceptible 
to  noise. 


5  Conclusion  and  Discussion 

We  design  and  implement  a  solution  for  reconstructing 
objects  with  curved  surfaces.  This  is  particularly  ap¬ 
plicable  to  objects  whose  cross-sections  are  conics  and 
which  can  be  described  by  straight  homogeneous  gener¬ 
alized  cylinders.  This  includes  a  large  set  of  man-made 
curved  objects.  The  recovered  curved  surface  also  pro¬ 
vides  a  depth  map  and  thus  surface  interpolation  is  no 
longer  necessary.  When  the  cross-section  of  a  curved 
object  is  not  a  conic,  the  terminator  of  the  object  can 
provide  information  for  interpolating  the  cross-section 
of  other  parts  of  the  object. 

Other  sources  of  information  may  also  give  additional 
constraints,  e.g.  specularity,  surface  markings,  and  sur¬ 
face  texture  can  be  used  to  obtain  the  fifth  parame¬ 
ter  required.  By  matching  surface  markings  or  surface 
texture,  we  can  deduce  the  range  data  of  points  on  a 
curved  surface.  When  these  points  are  coplanar  to  the 
conic  being  reconstructed,  they  provide  the  fifth  con¬ 
straint. 

A  more  global  approach  for  curved-surface  recon¬ 
struction  is  to  fit  one  curved  surface  to  all  the  limbs 
belonging  to  the  same  surface.  This  will  reduce  some 


of  the  noise  problems  we  encountered.  A  quadric  sur¬ 
face,  in  general,  requires  nine  parameters.  To  find  a 
closed  form  of  a  quadric  surface  in  terms  of  the  tan¬ 
gent  requirements  from  the  limbs  is  open  for  research. 
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Figure  11:  A  Pair  of  Gray-scale  Images  (Set  2) 


Figure  13:  Segmented  Curved  Surfaces  (Cylinder) 
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Figure  14:  Reconstructed  Curved  Surface  (Cylinder)  Figure  16:  A  Pair  of  Curved  Surfaces  (Sphere) 
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Figure  15:  Reconstructed  Curved  Surface  from  a  Dif-  ,  c  f 
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Abstract 

Geometric  camera  calibration  is  the  process  of  determining  a  map¬ 
ping  between  points  in  world  coordinates  and  the  corresponding  image 
locations  of  the  points.  In  previous  methods,  calibration  typically  in¬ 
volved  the  iterative  solution  to  a  system  of  non-linear  equations.  We 
present  a  method  for  performing  camera  calibration  that  provides  a 
complete,  accurate  solution,  using  only  linear  systems  of  equations. 
By  using  two  calibration  planes,  a  line-of-sight  vector  is  defined  for 
each  pixel  in  the  image.  The  effective  focal  point  of  a  camera  can  be 
obtained  by  solving  the  system  that  defines  the  intersection  point  of 
the  line-of-sight  vectors.  Once  the  focal  point  has  been  determined,  a 
complete  camera  model  can  be  obtained  with  a  straightforward  least 
squares  procedure.  This  method  of  geometric  camera  calibration  has 
the  advantages  of  being  accurate,  efficient,  and  practical  foT  a  wide 
variety  of  applications. 


1  Introduction 

Many  problems  in  computer  vision  and  graphics  require  mapping  points 
in  space  to  corresponding  points  in  an  image.  In  computer  graphics,  for 
example,  an  object  model  is  defined  with  respect  to  a  world  coordinate 
system.  To  generate  an  image,  the  points  that  lie  on  the  visible  surfaces  of 
the  object  must  be  mapped  onto  the  image  plane;  that  is,  3d  world  points 
must  be  mapped  onto  2d  image  points.  In  computer  vision,  the  image 
locations  of  points  on  an  object  can  be  used  to  infer  three-dimensional 
properties  of  the  object;  in  this  case,  2d  image  points  must  be  mapped  back 
onto  the  original  3d  world  points.  In  both  cases,  the  mapping  between 
3d  world  coordinates  and  2d  image  coordinates  must  be  known.  Geometric 
camera  calibration  is  the  process  of  determining  the  2d-3d  mapping  between 
a  camera  and  a  world  coordinate  system. 

We  decompose  the  general  problem  of  geometric  camera  calibration  into 
two  subproblcms: 

•  The  projection  problem;  given  the  location  of  a  point  in  space,  predict 
its  location  in  the  image;  that  is,  project  the  point  into  the  image. 

•  The  back-projection  problem:  given  a  pixel  in  the  image,  compute  the 
line-of-sight  vector  through  the  pixel;  that  is,  back-project  the  pixel 
into  the  world. 

A  complete  solution  to  the  camera  calibration  problem  entails  deriving  a 
model  for  die  camera  geometry  that  permits  the  solution  of  both  the  projec¬ 
tion  and  the  back-projection  problems.  For  many  applications,  a  complete 
solution  is  necessary.  Some  examples  from  ihe  domain  of  mobile  robots 
will  help  illustrate  the  problems. 

In  the  CMU  Navlab  project  [61,  a  robot  vehicle  follows  roads  using  data 
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from  a  color  TV  camera.  In  each  image,  the  road  is  extracted,  and  the 
centerline  and  direction  of  the  road  computed  in  image  coordinates.  These 
parameters  are  then  back-projected  into  vehicle  coordinates  and  used  to  plot 
a  course  for  the  vehicle  that  stays  within  the  road  boundaries. 

Turk,  et  al  18]  describe  a  similar  road  following  technique  for  the  Au¬ 
tonomous  Land  Vehicle  (ALV).  Rather  than  parameterizing  the  road  in  terms 
of  centerline  and  direction,  they  describe  the  road  boundaries  as  a  sequence 
of  points.  The  line-of-sight  vectors  for  each  of  the  boundary  points  are 
computed  by  back -projection.  The  intersections  of  the  line-of-sight  vectors 
with  the  ground  plane  yield  the  points  in  the  world  between  which  the  robot 
must  steer  to  stay  on  the  road.  In  addition,  the  ALV  needs  to  know  the  pre¬ 
dicted  position  of  the  road  in  each  image.  This  prediction  is  obtained  by 
projecting  the  location  of  the  road  into  each  image,  based  on  the  position 
of  the  road  in  the  previous  image,  and  the  motion  of  the  vehicle  between 
images. 


Figure  1 :  The  Pinhole  Camera  Model 

The  simplest  model  for  camera  geometry  is  ihe  pinhole,  or  perspective 
model.  See  figure  1.  Light  rays  from  in  front  of  the  camera  converge  at  the 
pinhole  and  arc  projected  onto  ihe  image  plane  at  the  back  of  the  camera. 
To  avoid  dealing  with  a  reversed  image,  the  image  plane  is  often  considered 
to  lie  in  front  of  the  camera,  al  the  same  distance  from  the  pinhole.  The 
dislancc  from  the  focal  point  to  the  image  plane  is  ihe  focal  length. 

A  perfect  lens  can  be  modeled  as  a  pinhole.  No  lens  is  perfect,  of 
course,  so  part  of  the  problem  of  geometric  camera  calibration  is  correcting 
for  lens  distortions.  The  most  accurate  and  conceptually  simple  method  of 
camera  calibration  would  be  lo  measure  calibration  parameters  al  each  pixel 
in  the  image.  For  example,  al  each  pixel  measure  the  line-of-sight  vector. 
This  would  produce  a  gigantic  lookup  table.  Then,  given  a  pixel  in  an 
image,  simple  indexing  would  yield  the  line-of-sight  vector  that  solves  ihe 
back-projcction  problem.  To  solve  the  projection  problem,  the  table  would 
be  searched  lo  find  the  line-of-sighl  vector  that  passes  nearest  ihe  point  in 
question. 

A  lookup  table  of  calibration  data  for  each  pixel  would  be  prohibitive!.’ 
expensive.  The  obvious  compromise  is  to  sample  the  image,  and  interpolate 
between  dala  points.  If  the  error  in  interpolation  is  less  than  the  measure¬ 
ment  error,  no  accuracy  is  lost.  Most  approaches  to  geometric  camera 
calibration  involve  sampling  the  image,  and  solving  for  the  parameters  of 
the  interpolation  functions.  The  obvious  differences  between  approaches  arc 
in  the  form  of  the  interpolation  functions,  and  the  mathematical  techniques 
used  lo  solve  for  the  parameters.  The  main  intent  of  most  calibration  work 
has  been  the  solution  of  the  back-projection  problem.  The  projcclion  prob- 


lem  has  occasionally  been  overlooked.  The  following  paragraphs  briefly 
describe  past  work. 

•  Sobel  [5]  introduced  a  method  for  calibration  that  involved  the  solu¬ 
tion  of  a  large  system  of  non-linear  equations.  In  addition  to  solving 
for  the  intrinsic  camera  parameters,  his  method  also  solved  for  extrin¬ 
sic  camera  parameters,  such  as  camera  pan  and  tilt.  Sobel  used  the 
basic  pinhole  model,  and  solved  the  system  using  a  non-linear  opti¬ 
mization  method.  He  did  not  model  lens  distortions,  and  the  system 
depended  on  the  user  to  provide  initial  parameters  for  the  optimization 
technique. 

Tsai  [7]  improved  on  the  general  non-linear  approach  in  several  ways. 
He  modeled  distortions  globally  using  fourth  order  functions,  and 
presented  a  method  for  computing  good  initial  parameters  for  the 
optimization  technique.  Tiai's  model  of  lens  distortions  assumes  that 
the  distortions  are  radially  symmetric. 

•  Yakimovsky  and  Cunningham  [9]  presented  a  calibration  technique 
that  also  used  a  pinhole  model  for  the  camera.  They  treated  some 
combinations  of  parameters  as  single  variables  in  order  to  formulate 
the  problem  as  a  system  of  linear  equations.  However,  in  this  formu¬ 
lation,  the  variables  are  not  completely  linearly  independent,  yet  are 
treated  as  such.  No  lens  distortions  are  modeled  with  this  approach. 

•  Martins,  Birk,  and  Kelley  [3]  reported  a  calibration  technique  that 
does  not  utilize  an  explicit  camera  model.  Their  two-plane  calibra¬ 
tion  method  consisted  of  measuring  the  calibration  data  for  various 
pixels  across  the  image.  The  data  for  other  pixels  is  computed  by  in¬ 
terpolation.  The  back-projection  problem  is  solved  by  computing  the 
vector  that  passes  through  the  interpolated  points  on  each  calibration 
plane.  The  interpolation  can  be  either  local  or  global.  The  two-plane 
method  solves  only  the  back-projection  problem. 

Isaguirre.  Pu.  and  Summers  [2]  extended  the  two-plane  method  to 
include  calibration  as  a  function  of  the  position  and  orientation  of  the 
camera.  They  used  an  iterative  approach  based  on  Kalman  filters  to 
obtain  the  solution. 

Our  goal  in  camera  calibration  was  to  develop  a  single,  basic  calibration 
procedure  to  solve  both  the  projection  and  back-projection  problems  for 
a  variety  of  applications.  Consequently,  the  desired  procedure  had  to  be 
conceptually  straightforward,  easily  extended  to  obtain  various  degrees  of 
accuracy,  and  computationally  efficient.  To  meet  these  requirements,  we 
chose  to  begin  with  the  two-plane  method  of  Martins,  Birk,  and  Kelley. 
Section  2  discusses  the  two-plane  method  and  the  solution  to  die  back- 
projection  problem.  This  method  can  be  made  arbitrarily  accurate;  the 
only  problem  is  that  it  fails  to  solve  the  projection  problem.  In  section  3 
we  present  a  method  for  solving  the  projection  problem  that  utilizes  the 
calibration  data  from  the  two-plane  method.  The  solution  to  the  projection 
problem  is  a  simple  application  of  analytic  geometry,  and  is  completely 
formulated  with  systems  of  linear  equations. 

The  calibration  method  presented  in  this  paper  has  been  implemented 
and  tested  in  the  Calibrated  Imaging  Laboratory  at  CMU  [4].  Results  are 
presented  in  section  4  that  demonstrate  the  accuracy  of  this  method. 

2  The  Solution  to  the  Back-Projection  Problem 

Martins,  Birk,  and  Kelly  [3]  first  formally  presented  the  two-plane  calibra¬ 
tion  technique  for  solving  the  back-projection  problem.  This  technique  has 
the  advantage  that  it  provides  exactly  the  information  needed — the  ray  in 
space  that  defines  the  line  of  sight  of  a  given  pixel — without  any  explicit 
camera  model. 

Figure  2  illustrates  the  concept  of  two-plane  calibration.  Let  Pi  and  Pi 
denote  the  calibration  planes.  Assume  that  the  3d  locations  of  the  calibration 
points  on  each  plane  are  measured.  An  image  of  each  plane  is  acquired, 
and  the  image  location  of  each  of  the  calibration  points  is  extracted.  Let 
the  calibration  points  be  denoted  p,j,  and  the  corresponding  image  locations 
be  denoted  q^,  where  i  =  1, 2  is  the  plane,  and  j  =  1, 2, ....  n  is  the  point 
index.  Thus,  the  image  of  p,y  is  q,,. 

Let  p  and  y  denote  the  row  and  column  coordinates,  respectively,  of  an 
image.  Then,  given  a  point  v  =  [  p  7  ] 1  in  the  image,  the  line-of-sight 
vector  for  v  can  be  computed  as  follows.  First,  use  the  points  pi,  and  q  1/ 


10  interpolate  the  location  of  v  on  the  first  calibration  plane,  P 1 .  Call  this 
point  ui .  Then,  interpolate  to  find  the  location  of  v  on  the  second  calibration 
plane.  Call  this  point  ui.  The  pixel  line-of-sight  vector  then  has  direction 
ui  —  U2  and  passes  through  the  point  ui. 

Various  types  of  interpolation  can  be  used,  with  different  degrees  of 
accuracy.  Martins,  et  al  report  three  types  of  interpolation:  linear,  quadratic, 
and  linear  spline.  The  two-plane  method  has  the  potential  for  being  the  most 
accurate  of  any  calibration  method  for  the  solution  of  the  back-projection 
problem.  At  the  limit,  this  technique  consists  of  measuring  the  line-of-sight 
vectors  for  each  pixel  in  the  image.  As  will  be  seen  in  Section  4,  the 
number  of  calibration  points  used  has  a  strong  influence  on  the  accuracy  of 
the  calibration. 


image  plane  calibration  grid  1  calibration  grid  2 


Figure  2:  Two-Plane  Calibration 

2.1  Global  Interpolation 

One  approach  to  interpolating  the  calibration  data  is  to  globally  fit  an  in¬ 
terpolation  function  to  the  data.  This  function  is  then  used  for  any  pixel 
across  the  entire  image.  Global  interpolation  has  the  effect  of  averaging 
errors  over  all  the  pixels  so  that  the  resultant  line-of-sight  vector  is  exact 
for  no  pixel,  but  is  close  for  all  pixels.  This  has  the  advantage  of  reducing 
the  sensitivity  to  errors  or  noise  in  measurements.  On  the  other  hand,  the 
form  of  the  interpolation  function  is  an  a  priori  assumption  about  the  lens 
distortions,  and  may  or  may  not  be  appropriate. 

2.1.1  Linear  Interpolation 

Let pij  =  [  x  y  z  ]',  and  q,,  =  {  p  7  1  ]'.  Then  a  linear  transforma¬ 

tion  between  p  and  q  is  given  by 

Pij  =  Aqij 

where  A,  is  a  3x3  matrix.  Given  n  measurements  on  each  plane,  we  can 
then  form  the  system 

[  Pi  1  Pi i  ...  />,»  ]  =  A  [  q,i  q,2  ...  qm  ] 
or, 

P,  =  AQ, 

This  system  can  be  solved  in  the  least  squares  sense  by  using  the  matrix 
pseudoinverse  (also  called  the  generalized  matrix  inverse)  [1]: 

A  =  fete,]'' Q'A 

Given  a  pixel  v  in  the  image,  the  direction  of  the  line-of-sight  vector 
through  v  is  given  by  iq  -  u2  where  iq  =  Ay  v,  and  ui  =  A2V. 

2.1.2  Quadratic  Interpolation 

Quadratic  interpolation  is  similar  to  linear,  except  that  second-order  terms 
are  used  in  the  parameterization,  and  the  matrix  A  is  6x6.  We  represent  a 
point  in  space  by  p  =  [  x  y  z  ]',  but  we  represent  image  locations  by 
q  =  [  p  7  p1  72  fry  1  ]  .  With  these  modifications,  the  formulation 
is  otherwise  identical.  Martins,  el  al  report  that  quadratic  interpolation  was 
more  accurate  than  linear. 

2.2  Local  Interpolation 

With  no  a  priori  knowledge  about  the  lens  distortions,  global  interpolation 
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may  be  inappropriate.  A  better  approach  may  be  to  model  the  distortions 
locally.  If  the  calibration  data  is  dense  enough,  the  interpolation  can  be 
very  accurate.  In  the  paragraphs  below,  we  discuss  a  technique  called  linear 
spline  interpolation,  which  uses  a  linear  function  to  perform  interpolation 
over  each  local  region. 

Conceptually,  this  technique  of  interpolation  consists  of  tesselating  each 
calibration  grid  with  triangles,  and  performing  linear  interpolation  within 
each  triangle.  The  calibration  points  form  the  vertices  of  the  triangles.  A 
plane  is  defined  uniquely  by  three  points,  so  no  errors  are  introduced  at  the 
vertices.  This  is  not  the  case  for  global  interpolation  techniques,  in  which 
errors  are  averaged  over  all  points,  including  calibration  points.  Martins,  et 
al  achieved  their  best  accuracy  using  this  form  of  interpolation.  In  section 
4,  we  report  experiments  which  confirm  this  result. 

In  our  current  implementation,  the  grid  is  not  tesselated  in  advance. 
Instead,  for  any  point  v  in  the  image,  each  calibration  grid  is  searched  to 
find  the  three  closest  calibration  points.  The  linear  interpolation  matrices  A , 
are  computed  using  just  three  points  each.  The  line-of-sight  vector  is  then 
computed  as  in  Section  2.1.1.  Figure  3  illustrates  the  procedure. 
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Figure  3:  Linear  Spline  Interpolation 


3  Linear  Solution  to  the  Projection  Problem 

The  two-plane  method  for  solution  to  the  back-projection  problem  did  not 
utilize  an  explicit  camera  model.  In  order  to  use  the  two-plane  data  to  obtain 
a  solution  to  the  projection  problem,  it  is  necessary  to  have  a  camera  model 
to  formulate  the  equations.  We  use  a  model  similar  to  that  of  Yakimovsky 
and  Cunningham  [9], 

In  figure  4  we  begin  with  a  pinhole  model  and  define  the  following 
vectors  and  points: 

•  P  -  {  Px  Py  Pi  ] '  =  the  vector  from  the  origin  to  a  point  in  space. 

•  F  =  [  /,  fy  f,  ]  '  =  the  vector  from  the  origin  to  the  camera  focal 

point. 

•  R  =  [  r,  ry  r,  ]  =  a  vector  that  points  along  the  direction  of 

increasing  row  number.  R  represents  the  displacement  vector  from 

one  pixel  to  the  next  in  the  row  direction.  The  magnitude  of  R  is  the 
row  scale  factor. 

•  C  =  [  c,  c,  c,  ]'  =  a  vector  that  points  along  the  direction  of 
increasing  column  number.  C  represents  the  displacement  vector  from 
one  pixel  to  the  next  in  the  column  direction.  The  magnitude  of  C  is 
the  column  scale  factor. 

•  [  Pp  1>  ]  =  the  piercing  point  of  the  image,  or  the  point  where 
the  optical  axis  pierces  the  image  plane. 

The  vectors  R  and  C  define  the  orientation  and  scale  of  the  image  plane. 
Columns  in  the  image  plane  are  parallel  to  R,  while  rows  arc  parallel  to  C. 

The  projection,  (  p  7  | ,  of  a  point  P  onto  the  image  plane  can  be 
computed  by  taking  the  dot  product  of  the  vector  from  the  focal  point  to  P, 
and  adding  the  offset  to  the  piercing  point.  For  example,  consider  computing 
the  row  coordinate,  p,  of  the  projection.  The  vector  P  -  F  is  the  vector  from 


the  focal  point  that  passes  through  P.  Every  point  along  this  vector  will  have 
the  same  location  in  the  image.  Let  V  be  a  normalized  vector  along  P  -  F, 
that  is,  let 

'/=Frfii  (1> 

where  ||  ||  denotes  the  length  of  a  vector.  Let  0  represent  the  usual  vector 
dot  product.  Then  VQR  represents  the  projection  of  V  onto  R  measured  in 
row  units.  Adding  the  row  coordinate  of  the  piercing  point  translates  V  0  R 
into  image  coordinates. 

Therefore,  the  image  location,  [  p  7  j ,  of  a  point  P  can  be  computed 
using  the  equations: 

P  =  VQR  +  Pf,  (2) 

7  =  V©C  +  7r  (3) 

If  F,  R,  C,  pp,  and  yp  are  all  unknown,  then  the  resulting  system  is  non¬ 
linear.  However,  the  two-plane  formulation  of  the  back-projection  problem 
yields  the  information  needed  to  make  solving  for  the  focal  point  location 
a  linear  problem. 


Figure  4:  A  Linear  Model  of  Camera  Geometry 

3.1  Focal  Point  Solution 

In  a  pinhole  camera,  all  incoming  light  rays  pass  through  the  focal  point. 
Since  a  lens  is  not  a  perfect  pinhole,  we  instead  refer  to  an  effective  focal 
point,  which  is  the  point  that  is  closest  to  all  the  rays.  From  the  two-plane 
method  for  the  back-projection  problem,  one  can  compute  a  bundle  of  rays 
that  pass  through  the  lens.  The  next  step  is  to  find  the  point  in  space  that 
minimizes  the  distance  to  all  the  rays. 

The  equation  for  the  squared  distance,  d2,  from  a  point  P  =[  x  y  2]' 
to  the  line  through  Pt  =  [  y,  z,  ]’  in  direction  [a  b  c  ]’  (where 
a2  +  b2  +  c2  =  1)  is: 

^iF-yi  z-2i  |2+  2  — r,  x-x,  |2+  x-xi  y-yi  I2 

I  *  c  |  c  a  |  a  b  \ 

Expansion  of  terms  yields: 

d2  =  -r2(b2  +  c2)  +  >'2(a2  +  c2)  +  z2(a2  +  b2) 

—  Ixyab  —  Ixzac  —  lyibc 
+  2 x(bk)  -  cki)  +  2y(ck\  -  akf)  +  lz(ak2  -  bk,) 

+  +Jfc|  +  tj 


zi b  -  y ,c 


*3  =  y\a-X\b 

To  find  the  effective  focal  point,  wc  need  to  minimize  1)  -  ]T  if .  Dif¬ 
ferentiating  D  with  respect  to  x,  y,  and  2  yields: 
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dDfdx  =  ^  2x(62  +  c2)-^r  2yab  -  J2  2zac  +  Y1 2<-bki  ~  ckl) 

3D/dy  =  ^2y(a2  +  c2)-^2xob-^2z*c  +  ^2(c*i-n*3) 

dD/dz  =  Y1  ll(-al  +  *2)  "  S  2xac  ~  2Z  2y*c  +  ^  2(u*2  “  **t) 

The  sums  are  taken  over  all  the  iine-of-sight  vectors  (a,  b,  c,  k\,  k2,  k2  are 
functions  of  the  vectors). 

Now,  by  setting  the  derivatives  of  D  to  zero  to  find  the  minima,  and 
putting  the  equations  in  matrix  form,  we  obtain: 

h  =  Af 
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So  the  solution  we  seek,  /,  the  effective  focal  point  of  the  camera,  is 
simply: 

f  =  A-'h 

3.2  Computation  of  the  Camera  Base  Vectors 

Equations  (2)  and  (3)  relate  the  position  of  a  point  in  space  to  a  correspond¬ 
ing  image  location.  These  equations  can  be  written  as  a  linear  system: 

'  rx  ct' 

\  P  7  ]  -  [  v,  v,  v,  1.0  )  r»  £ 

.  Pp  7,  . 

Given  N  points  in  space,  we  have  2 N  equations  to  solve  for  the  8  un¬ 
knowns  in  R,  C,  pp,  and  yp: 

r  Pi  21  1  [  V,1  V;1  V,1  1.0  r  r,  c,  1 
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B  -  WX 

So,  using  the  pseudoinverse  to  obtain  a  least  squares  solution,  we  have: 
X=  [Ww]-1  WB 

X  contains  the  values  for  R,  C,  pp,  and  7,. 

3.3  The  Local  Projection  Problem 

In  section  2  we  presented  several  ways  of  modeling  camera  geometry  for  the 
back-projection  problem.  The  linear  interpolation  technique  (section  2.1.1), 
which  involved  fitting  a  first-order  transformation  to  all  the  calibration  data, 
is  a  global  modeling  technique.  The  linear  spline  interpolation  technique 
(section  2.2)  is  a  local  modeling  technique,  since  at  each  pixel,  only  the 
calibration  data  in  a  local  region  around  the  pixel  is  used  to  compute  the 
interpolation  function.  The  results  of  Martins,  et  al  [31,  and  our  laboratory 
results  (see  section  4)  both  indicate  that  a  local  modeling  technique  yields 
superior  accuracy. 

The  solution  to  the  projection  problem  presented  in  sections  3.1  and  3.2 
is  a  global  modeling  technique.  All  the  calibration  data  is  used  to  com¬ 
pute  the  model  parameters,  and  the  results  are  used  to  solve  the  projection 
problem  for  any  point  in  space.  In  direct  analogy  to  the  linear  spline  tech¬ 
nique  used  in  back-projection,  a  technique  can  be  derived  for  using  local 
information  to  improve  the  accuracy  of  solution  to  the  projection  problem. 


Our  local  projection  technique  involves  finding  a  linear  model  for  local 
regions  of  the  image.  The  technique  involves  two  steps.  In  the  first  step, 
the  global  solution  is  used  to  obtain  an  estimated  image  location  for  a 
point  in  space.  That  estimated  image  location  is  used  to  find  the  four 
nearest  calibration  points  on  each  calibration  plane.  These  points  are  used 
to  compute  a  local  linear  solution  to  the  projection  problem.  The  local  linear 
solutions  could  also  be  precomputed,  and  the  global  solution  would  then  be 
used  simply  to  index  the  correct  local  solution. 

4  Experimental  Results 

Measurements  and  tests  were  conducted  within  the  Calibrated  Imaging  Lab¬ 
oratory  (CIL)  at  CMU  (Shafer  [4]).  The  CIL  is  a  facility  that  provides  a  pre¬ 
cision  imaging  capability.  The  purpose  of  the  CEL  is  to  provide  researchers 
with  accurate  knowledge  about  ground  truth  so  that  computer  vision  theories 
can  be  tested  under  controlled  scientific  conditions.  Of  particular  interest  for 
this  study,  the  CIL  provides  facilities  to  accurately  measure  point  locations, 
and  to  accurately  position  and  orient  cameras. 

Position  measurement  of  points  in  the  CIL  is  performed  with  the  use 
of  theodolites  (surveyor’s  transits),  which  are  basically  telescopes  with 
crosshairs  for  sighting,  mounted  on  accurate  pan/lill  mechanisms.  Objects 
to  be  measured  are  placed  at  one  end  of  an  optical  bench;  the  theodolites  are 
fixed  to  the  other  end,  separated  by  a  little  more  than  1  meter.  To  measure 
the  position  of  a  point,  the  crosshairs  of  each  theodolite  are  placed  over  the 
point,  and  the  horizontal  and  vertical  displacements  read  off.  Trigonomet¬ 
ric  equations  then  yield  the  position  of  the  point  in  a  Cartesian  coordinate 
system  defined  with  respect  to  the  theodolites.  As  currently  configured,  the 
theodolites  can  determine  point  locations  to  less  than  0.1  mm. 

4.1  Test  Scenario 

The  laboratory  tests  described  below  were  designed  to  provide  answers  to 
the  following  questions: 

1.  What  accuraries  can  be  expected  from  off  the  shelf  cameras  and 
lenses? 

2.  How  does  increasing  the  number  of  calibration  points  affect  the  ac¬ 
curacy  of  calibration? 

3.  What  is  the  expected  accuracy  for  the  projection  problem? 

Tests  were  performed  using  a  calibrated  grid.  The  grid  consisted  of 
horizontal  and  vertical  lines  1mm  in  width,  spaced  12.7  mm  apart.  The 
intersections  of  the  lines  on  the  grid  were  used  as  calibration  points.  A 
special  intersection  detector  was  implemented  to  extract  the  intersections 
from  digital  images  with  sub-pixel  precision.  Each  time  the  grid  was  moved, 
new  measurements  were  taken,  an  image  digitized,  and  the  intersection 
detector  applied.  The  result  was  a  data  file  in  which  each  calibration  point 
was  associated  with  its  3d  position  and  its  image  location. 

A  complete  test  consisted  of  data  from  three  different  grid  locations. 
Due  to  the  size  of  the  laboratory,  focal  length  of  the  lens,  and  depth  of  field 
of  the  lens,  the  grid  was  typically  placed  at  distances  ranging  from  0.50m 
to  0.56m  from  the  camera.  Data  from  two  of  the  grid  locations  was  used  to 
compute  calibration  parameters.  These  parameters  were  then  tested  using 
data  from  the  third  grid  location.  The  third  grid  will  often  be  referred  to 
as  the  test  grid.  In  each  of  the  tests  reported  here,  the  focus  of  the  camera 
was  kept  fixed.  The  camera  used  was  a  Sony  CCD,  model  AVC-D1,  with 
the  standard  16mm  lens. 

A  total  of  300  calibration  points  were  used  on  each  grid.  Rather  than 
measure  the  location  of  each  point  individually,  the  location  of  each  point 
was  computed  based  on  the  measured  locations  of  the  center  point  and  the 
four  corners.  Consequently,  the  accuracy  of  the  data  depended  not  only  on 
the  accuracy  of  the  theodolites,  but  also  upon  factors  such  as  the  planarity 
of  the  grid,  and  the  precision  of  the  grid  lines.  In  preparing  for  each  test,  the 
overall  accuracy  of  the  calibration  data  was  estimated.  For  several  points 
al  each  grid  location,  the  3d  locations  were  measured  using  the  theodolites. 
The  measured  locations  were  then  compared  with  the  computed  locations. 
Differences  of  up  to  0.2  mm  were  recorded,  with  typical  differences  being 
between  0.1  and  0.2  mm.  The  accuracy  of  the  calibration  method  is  limited 
by  the  accuracy  of  the  calibration  data,  so  the  best  accuracy  achievable  in 
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this  scenario  is  between  0.1  and  0.2  mm. 

The  effects  of  density  of  calibration  points  on  calibration  accuracy  was 
tested  by  varying  the  number  of  calibration  points  used.  This  was  easily 
implemented  by  simply  skipping  over  some  of  the  rows  and  columns  in  the 
grid.  In  each  case,  the  calibration  points  were  uniformly  distributed  over 
the  image.  Data  is  reported  for  the  following  distributions  of  points:  3x3. 
5x7,  7x10,  15x20. 

In  all  the  tests  repeated  below,  grid  0  refers  to  the  grid  location  farthest 
from  the  camera,  while  grid  2  refers  to  the  grid  location  closest  to  the 
camera.  In  all  cases,  the  number  of  points  was  varied  to  compute  the 
calibration  parameters,  but  all  300  calibration  points  on  the  test  grid  were 
used  in  testing. 

4.2  Back-Projection  Results 

To  test  the  accuracy  of  the  back-projection  problem,  the  image  location  of 
each  of  the  calibration  points  on  the  third  grid  was  used  to  compute  a  line- 
of-sight  vector.  The  intersection  of  this  vector  with  the  plane  of  the  test 
grid  was  computed,  and  the  distance  between  the  intersection  and  the  actual 
position  was  used  as  the  error  measure.  In  the  results  reported  below,  the 
errors  are  averages  taken  over  all  the  calibration  points. 

Table  1  presents  the  results  obtained  for  the  back-projection  problem. 
The  first  error  column  contains  the  results  for  global  linear  interpolation.  For 
this  method,  the  density  of  the  calibration  grid  makes  little  or  no  difference 
to  the  accuracy  of  the  result.  This  was  expected;  since  the  calibration  points 
are  uniformly  distributed  across  the  grid,  additional  points  do  not  provide 
additional  information  for  a  linear  fit.  The  best  accuracy  is  achieved  when 
the  test  grid  is  positioned  between  the  other  two  grids  used  for  calibration. 


array 

calibration 

lest 

error  (mm)  | 

size 

grids 

grid 

global 

local 

3x3 

0,  1 

2 

1.921 

0.731 

0.2 

1 

0.388 

0.296 

1,  2 

0 

0.561 

0.317 

5x7 

0,  1 

2 

1.854 

0.740 

0,2 

1 

0.366 

0.166 

1,2 

0 

0.551 

0.235 

7x10 

0,  1 

2 

1.810 

0.696 

0.2 

1 

0.350 

0.169 

1,  2 

0 

0.509 

0.185 

15x20 

0,  1 

2 

1.813 

0.666 

0,  2 

1 

0.366 

0.147 

1,  2 

0 

0.534 

0.201 

Table  1:  Calibration  Accuracy  of  the  Back-Projection  Problem 

The  second  error  column  in  table  1  presents  the  results  obtained  for  back- 
projection  problem  using  local  linear  spline  interpolation.  This  time  there 
is  a  general  trend  for  greater  accuracy  with  more  calibration  points.  This 
reflects  the  fact  that  the  linear  spline  method  interpolates  over  local  regions, 
and  can  more  accurately  approximate  effects  such  as  barrel  distortion.  There 
are  instances  observable  in  the  table  which  seem  to  contradict  the  general 
trend;  these  are  most  likely  due  to  noise  in  the  measurements  or  in  the 
process  of  point  extraction.  Over  a  number  of  trials,  the  general  trend  has 
been  consistent. 

The  results  in  table  1  agree  with  the  results  obtained  by  Martins,  et  al. 
To  summarize,  the  local  linear  spline  interpolation  procedure,  with  as  few 
as  12  calibration  points,  is  more  accurate  than  global  linear  interpolation. 
In  addition,  the  use  of  more  calibration  points  improves  the  accuracy  of  the 
local  linear  spline  method.  The  accuracies  we  achieved  in  our  tests  were  at 
the  level  of  the  accuracies  of  our  measurements. 

4.2.1  Projection  Results 

The  accuracy  of  the  projection  problem  was  tested  with  a  procedure  sim¬ 
ilar  to  that  used  in  the  back-projection  problem.  The  3d  location  of  each 
calibration  point  on  the  test  grid  was  projected  into  the  image  plane,  and 
the  difference  (in  pixels)  between  the  projected  location  and  the  measured 
location  was  used  as  the  error  measure. 


The  test  results  for  the  projection  problem  are  reported  below  in  table 
2.  The  first  error  column  gives  the  error,  in  pixels,  of  the  accuracy  using 
global  interpolation.  The  average  error  reported  in  all  cases  was  less  than 
two  pixels,  which  is  good  enough  for  many  applications.  The  results  indicate 
that  the  standard  lenses  for  our  cameras  are  reasonably  good,  and  can  be 
approximated  well  with  a  pinhole  model. 

The  second  error  column  in  table  2  reports  the  errors  recorded  using  the 
local  solution  to  the  projection  problem. 

A  comparison  of  the  two  columns  in  table  2  shows  an  improvement 
resulting  from  using  local  information.  In  general,  the  results  from  the 
local  solution  to  the  projection  problem  are  a  factor  of  2  improved  over  the 
global  solution.  The  entries  followed  by  a  *  are  examples  where  the  global 
result  was  better  than  the  local  result;  this  may  be  an  effect  of  errors  in 
the  measurement  process.  The  general  conclusion  that  can  be  drawn  is  that 
local  models  of  camera  geometry  provide  more  accurate  results  than  global 
models — for  simple  interpolation  functions. 


array 

calibration 

test 

error  (pixels)  | 

size 

grids 

grid 

global 

local 

3x3 

0,  1 

2 

1.58 

1.70* 

0,  2 

1 

0.89 

0.65 

1,  2 

0 

0.89 

0.82 

5x7 

0,  1 

2 

1.32 

1.29 

0,  2 

1 

0.95 

0.36 

1,  2 

0 

0.98 

0.46 

7x10 

0,  1 

2 

1.20 

1.01 

0,2 

1 

0.91 

0.34 

1,  2 

0 

0.88 

0.40 

15x20 

0,  1 

2 

1.15 

1.38* 

0,2 

1 

0.92 

0.92 

1.  2 

0 

0.91 

0.36 

Table  2:  Accuracy  of  the  Projection  Problem 
4.2.2  Conclusions 

In  section  4.1,  we  enumerated  three  questions  which  were  to  be  answered  by 
the  tests  reported  above.  We  now  proceed  to  answer  each  of  these  questions 
in  turn. 

1.  What  accuracies  can  be  expected  from  off  the  shelf  cameras  and 
lenses? 

Tables  1  and  2  of  test  results  show  the  accuracy  achievable  with  a 
standard  commercial  CCD,  using  the  standard  lens  supplied  with  the 
camera.  With  a  simple  global  interpolation  scheme,  accuracies  as 
good  as  1  part  in  1400  (0.3  mm  over  530  mm)  can  be  obtained.  With 
a  more  sophisticated  local  linear  spline  interpolation,  the  accuracies 
can  be  increased  to  1  part  in  3500. 

2.  How  does  increasing  the  number  of  calibration  points  affect  the  ac¬ 
curacy  of  calibration? 

We  have  shown  that  the  maximam  accuracy  for  global  linear  inter¬ 
polation  can  be  achieved  wilh  a  small  number  of  calibration  points, 
provided  that  the  points  are  uniformly  distributed  over  the  image.  Fur¬ 
ther  increasing  the  number  of  calibration  points  has  no  effect  on  the 
accuracy.  With  a  local  linear  spline  interpolation,  adding  calibration 
points  clearly  improves  the  accuracy  of  the  back-projection  problem, 
until  the  limiting  accuracy  of  the  calibration  data  is  reached. 

3.  What  is  the  expected  accuracy  for  the  projection  problem? 

Using  either  a  local  or  global  solution,  the  projection  problem  can 
be  solved  to  within  two  pixels;  results  as  good  as  0.34  pixels  were 
reported.  For  many  applications,  solution  of  the  projection  problem 
need  not  be  extremely  accurate.  In  many  instances,  the  projected  pixel 
location  is  only  needed  to  find  the  center  of  a  region  within  which  an 
operation  will  be  performed.  For  these  applications,  accuracy  of  one 
to  two  pixels  is  adequate. 


It  is  important  to  note  that  the  local  interpolation  outperformed  the 
global  interpolation.  While  the  differences  were  not  great  in  our 
tests,  the  lenses  we  used  were  fairly  linear,  if  extremely  wide  angle 
lenses  are  used,  the  distortions  may  be  large,  and  the  ability  to  locally 
interpolate  will  be  much  more  important. 

5  Discussion 

We  have  presented  a  calibration  method  that  we  believe  meets  many  of  the 
requirements  of  a  basic  calibration  technique  that  can  be  used  for  a  variety 
of  applications.  Our  method  is  based  on  the  two-plane  method  of  Mar. ms, 
Birk,  and  Kelley  [3],  but  is  extended  to  include  a  solution  to  the  projection 
problem.  We  believe  that  the  method  presented  here  has  many  advantages, 
described  in  the  following  paragraphs: 

•  Completeness. 

The  original  two-plane  method  of  calibration  only  provided  a  solution 
to  the  back-projection  problem.  While  this  is  sufficient  for  many 
applications,  a  solution  to  the  projection  problem  is  also  necessary 
for  applications  such  as  mobile  robots.  We  have  extended  die  two- 
plane  method  by  providing  a  solution  to  the  projection  problem. 

•  Accuracy. 

The  two-plane  calibration  method  can  be  made  arbitrarily  accurate. 
As  reported  in  section  4,  increasing  the  number  of  calibration  points 
results  in  increasing  accuracy.  If  no  improvement  results  from  adding 
more  points,  then  the  accuracy  of  the  calibration  data  must  be  im¬ 
proved. 

i tic  projection  problem  exhibits  much  of  the  same  behavior  as  the 
back-projection  problem.  Of  particular  interest  is  the  observation 
that  local  modeling  of  camera  geometry  improves  the  accuracy  of 
the  projection  problem,  as  well  as  the  back-projection  problem.  The 
accuracies  observed  in  our  tests  were  typically  less  than  one  pixel. 

•  Simplicity 

The  two-plane  model  is  conceptually  very  straightforward  and  easy 
to  implement.  The  use  of  the  linc-of-sight  vectors  to  solve  for  the 
parameters  of  a  linear  camera  model  arises  intuitively  from  the  ge¬ 
ometry  of  the  model.  The  method  of  solution  for  the  camera  model 
involves  solving  only  linear  equations,  so  no  sophisticated  optimiza¬ 
tion  techniques  ate  involved. 

•  Efficiency. 

Solution  of  either  the  back-projection  or  projection  problems  require 
only  a  few  matrix  multiplies  and  matrix  inversions  on  small  matrices. 
The  operations  arc  guaranteed  to  produce  a  unique  answer  within  a 
fixed  time.  While  a  relatively  large  amount  of  data  must  be  stored  for 
this  calibrauon  method  compared  to  other  methods,  the  total  amount 
is  still  insignificant. 

•  Practicality. 

Because  the  method  provides  a  complete  solution  to  the  geometric 
calibration  problem,  the  method  can  be  used  for  any  application.  The 
accuracy  can  be  arbitrarily  increased  (or  decreased)  to  meet  the  re¬ 
quirements  for  a  given  application.  The  only  change  in  the  method 
is  to  store  the  data  from  more  calibration  points.  The  mathematics 
remains  the  same,  and  no  special  equipment  is  required  beyond  that 
needed  to  obtain  precise  locations  for  the  calibration  points. 

In  addition  to  the  benefits  of  the  method  we  presented,  some  general 
observations  should  be  made. 

•  Without  the  benefit  of  a  priori  knowledge  of  the  form  of  lens  dis¬ 
tortions,  local  modeling  of  distortions  seems  to  perform  better  than 
global  modeling.  A  global  model  is  an  attempt  to  fit  the  data  into 
a  predetermined  form  and  average  the  error  across  the  enure  image. 
The  accuracy  of  the  interpolauon  is  limited  by  how  well  the  cho¬ 
sen  model  reflects  reality.  Conceivably,  a  different  function  could  be 
required  for  different  types  of  lenses  to  reflect  different  models  of 
distortion. 


Modeling  distortion  locally  makes  no  assumption  about  the  forms  of 
the  lens  distortions.  The  local  model  can  be  made  arbitrarily  accurate 
by  simply  sampling  at  more  pixels.  We  have  shown  that  relatively 
few  points  are  needed  to  achieve  the  level  of  accuracy  that  the  mea¬ 
surement  devices  provide.  Moreover,  local  modeling  is  more  accurate 
for  solving  both  the  projection  and  back-projection  problems. 

•  Nearly  all  of  the  data  reported  showed  that  the  best  accuracy  was 
obtained  when  the  test  grid  was  placed  between  the  two  grids  used 
for  calibration.  This  is  a  specific  instance  of  the  general  fact  that 
interpolation  is  more  accurate  than  extrapolation.  In  calibrating  a  real 
robotic  system,  the  calibration  data  should  ideally  be  obtained  so  as 
to  bound  the  region  of  interest  as  much  as  possible. 

Geometric  camera  calibration  may  depend  on  a  variety  of  factors,  For 
example,  the  focal  distance,  the  aperture  setting,  presence  or  absence  of  a 
filter,  or  even  the  operating  temperature  of  a  camera  may  all  affect  the  cali¬ 
bration  parameters.  We  are  currently  making  measurements  and  conducting 
tests  to  determine  the  sensitivity  of  calibration  parameters  to  many  of  these 
factors. 

The  results  reported  in  this  paper  were  obtained  using  data  from  the 
Calibrated  Imaging  Laboratory  at  CMU.  Our  next  application  of  this  method 
will  be  to  calibrate  and  register  three  cameras  and  a  laser  range  finder 
mounted  on  the  CMU  Navlab. 
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Abstract:  Before  corresponding  points  in  images  taken  with 
two  cameras  can  be  used  to  recover  distances  to  objects  in  a 
scene,  one  has  to  determine  the  position  and  orientation  of  one 
camera  relative  to  the  other.  This  is  the  classic  photogram  met¬ 
ric  problem  of  re u  live  orientation ,  central  to  the  interpretation 
of  binocular  stereo  information.  Iterative  methods  for  determin¬ 
ing  relative  orientation  were  developed  long  ago;  without  them 
we  would  not  have  most  of  the  topographic  maps  we  do  today. 
Relative  orientation  is  also  of  importance  in  the  recovery  of  mo¬ 
tion  and  shape  from  an  image  sequence  when  successive  frames 
are  widely  separated  in  time. 

Described  here  is  a  particularly  simple  iterative  scheme  for 
recovering  relative  orientation  that,  unlike  existing  methods, 
does  not  require  a  good  initial  guess  for  the  baseline  and  the 
rotation.  The  data  required  is  a  set  of  pairs  of  corresponding 
rays  from  the  two  projection  centers  to  points  in  the  scene.  It 
is  well  known  that  at  least  five  pairs  of  rays  are  needed.  Less 
appears  to  be  known  about  the  existence  of  multiple  solutions 
and  their  interpretation.  These  issues  are  discussed  here  in  de¬ 
tail.  The  unambiguous  determination  of  all  of  the  parameters 
of  relative  orientation  is  not  possible  when  the  observed  points 
lie  on  a  critical  surface. 

1.  Introduction 

The  positions  of  corresponding  points  in  two  images  can 
be  used  to  determine  the  positions  of  points  in  the  envi¬ 
ronment.  provided  that,  the  position  and  orientation  of  one 
camera  witli  respect  to  the  other  is  known.  Given  the  inter¬ 
nal  geometry  of  the  camera,  including  its  focal  length  and 
the  location  of  the  principal  point,  rays  can  be  constructed 
by  connecting  the  points  in  the  images  to  their  correspond¬ 
ing  projection  centers.  These  rays,  when  extended,  inter¬ 
sect  at  the  point  in  the  scene  that  gave  rise  to  the  image 
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points.  This  is  how  binocular  stereo  data  is  used  to  deter¬ 
mine  the  positions  of  points  in  the  environment  after  the 
correspondence  problem  has  been  solved. 

It  is  also  the  method  used  in  motion  vision  when  fea¬ 
ture  points  are  tracked  and  the  image  displacements  that 
occur  in  the  time  between  two  successive  frames  are  rel¬ 
atively  large  (see  for  example  [Ullman  1979]  and  [Tsai 
Huang  1984]).  The  connection  between  these  two  problems 
has  not  attracted  much  attention  before,  nor  has  the  rela¬ 
tionship  of  motion  vision  to  some  aspects  of  photogramme- 
try  (but  see  [Longuet-Higgins  1981]).  It  turns  out,  for  ex¬ 
ample,  that  the  well  known  motion  field  equations  [Longuet- 
Higgins  &  Prazdny  1980,  Bruss  &  Horn  1983]  are  just  the 
parallax  equations  of  photogranunetry  [Hallort  1900,  Moffit 
&  Mikhail  1980]  that  occur  in  the  incremental  adjustment 
of  relative  orientation.  Most  papers  on  relative  orientation 
only  give  the  equation  for  t/-parallax.  corresponding  to  the 
equation  for  the  y- component  of  the  motion  field  (see  for 
example  the  first  equation  in  [Gill  1964],  equation  (1)  in 
[Jochmann  1965],  and  equation  (6)  in  [Oswal  1967]).  Some 
papers  actually  give  equations  for  both  i-  and  y-parallax 
(see  for  example  equation  (9)  in  [Bender  1967]). 

In  both  binocular  stereo  and  large  displacement  motion 
vision  analysis,  it  is  necessary  to  first  determine  the  relative 
orientation  of  one  camera  with  respect  to  the  other.  The 
relative  orientation  can  be  found  if  a  sufficiently  large  set 
of  pairs  of  corresponding  image  points  have  been  identified 
[Thompson  1959b,  Thompson  1968,  Ghosh  1972.  Schwidef- 
sky  1973,  Slama  et  al.  1980,  Moffit  &  Mikhail  1980,  Wolf 
19S3,  Horn  1986]. 

Let  us  use  the  terms  left  and  right  to  distinguish  the 
two  cameras  (in  the  case  of  the  application  to  motion  vision 
these  will  be  the  camera  positions  and  orientations  corre¬ 
sponding  to  the  earlier  and  the  later  frames  respectively).1 


1  lit  what  follows  wo  use  I  lie*  coordinate  system  of  tin-  right  (or 
later)  camera  as  the  reference.  One  ran  simply  interchange 
left  atttl  right  if  it  happens  to  he  more  convenient  to  use  the 
coordinate  system  of  the  left  (or  earlier)  camera. 
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The  ray  from  the  center  of  projection  of  the  left  camera  to 
the  center  of  projection  of  the  right  camera  is  called  the 
baseline  (see  Fig.  1).  A  coordinate  system  can  be  erected 
at  each  projection  center,  with  one  axis  along  the  optical 
axis,  that  is,  perpendicular  to  the  image  plane.  The  other 
two  axes  are  in  each  case  parallel  to  two  convenient  orthog¬ 
onal  directions  in  the  image  plane  (such  as  the  edges  of  the 
image,  or  lines  connecting  pairs  of  fiducial  marks).  The  ro¬ 
tation  of  the  left  camera  coordinate  system  with  respect  to 
the  right  is  called  the  orientation. 

Note  that  we  cannot  determine  the  length  of  the  base¬ 
line  without  knowledge  about  the  length  of  a  line  in  the 
scene,  since  the  ray  directions  are  unchanged  if  we  scale  all 
of  the  distances  in  the  scene  and  the  baseline  by  the  same 
positive  scale-factor.  This  means  that  we  should  treat  the 
baseline  as  a  unit  vector,  and  that  there  are  really  only  five 
unknowns — three  for  the  rotation  and  two  for  the  direction 
of  the  baseline.2 


2.  Existing  Solution  Methods 


Various  empirical  procedures  have  been  devised  for  deter¬ 
mining  the  relative  orientation  in  an  analog  fashion.  Most 
commonly  used  are  stereoplotters,  optical  devices  that  per¬ 
mit  viewing  of  image  pairs  and  superimposed  synthetic  fea¬ 
tures  called  floating  marks.  Differences  in  ray  direction  par¬ 
allel  to  the  baseline  are  called  horizontal  disparities  (or  x- 
parallaxes),  while  differences  in  ray  direction  orthogonal  to 
the  baseline  are  called  vertical  disparities  (or  y-parallaxes).3 
Horizontal  disparities  encode  distances  to  points  on  the  sur¬ 
face  and  are  the  quantities  sought  after  in  measurement  of 
the  underlying  topography.  There  should  be  no  vertical  dis¬ 
parities  when  the  device  is  adjusted  to  the  correct  relative 
orientation,  since  the  rays  from  the  left  and  right  projection 
center  must  lie  in  a  plane  that  contains  the  baseline  if  they 
are  to  intersect. 


The  methods  used  in  practice  to  determine  the  cor¬ 
rect  relative  orientation  depend  on  successive  adjustments 
to  eliminate  the  vertical  disparity  at  each  of  five  or  six  care¬ 
fully  chosen  points  [Sailor  1960,  Thompson  1964,  Slaina  et 
al.  1980,  Moffit  &  Mikhail  1980,  Wolf  1974],  In  each  of 
the  adjustments  a  single  parameter  of  the  relative  orienta¬ 
tion  is  varied  in  order  to  remove  the  vertical  disparity  at 
one  of  the  points.  Which  adjustment  is  made  to  eliminate 
the  vertical  disparity  at  a  particular  point  depends  on  the 


2If  we  treat  the  baseline  as  a  unit  vector,  its  actual  length 
becomes  the  unit  of  length  for  all  other  quantities. 


3This  naming  convention  stems  from  the  observation  that, 
roughly  speaking,  in  the  usual  viewing  arrangement,  hori¬ 
zontal  disparities  correspond  to  left -right,  displacements  in 
the  image,  whereas  vertical  disparities  correspond  to  up-down 
displacements. 
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Figure  1.  Points  in  the  environment  are  viewed  from  two  camera 
positions.  The  relative  orientation  is  the  direction  of  the  base¬ 
line  b,  and  the  rotation  R  relating  the  left  and  right  coordinate 
systems.  The  directions  of  rays  to  at  least  five  scene  points  must 
be  known  in  both  camera  coordinate  systems. 


particular  method  is  chosen.  In  each  case,  however,  one 
of  the  adjustments,  rather  than  being  guided  visually,  is 
made  by  an  amount  that  is  calculated,  using  the  measured 
values  of  earlier  adjustments.  The  calculation  is  based  on 
the  assumptions  that  the  surface  being  viewed  can  be  ap¬ 
proximated  by  a  plane,  that  the  baseline  is  roughly  parallel 
to  this  plane,  and  that  the  optical  axes  of  the  two  cameras 
are  roughly  perpendicular  to  this  plane.  The  whole  process 
is  iterative  in  nature,  since  the  reduction  of  vertical  dis¬ 
parity  at  one  point  by  means  of  an  adjustment  of  a  single 
parameter  of  the  relative  orientation  disturbs  the  vertical 
disparity  at  the  other  points.  Convergence  is  usually  rapid 
if  a  good  initial  guess  is  available.  It  can  be  slow,  however, 
when  the  assumptions  on  which  the  calculation  is  based 
are  violated,  such  as  in  “accidented"  or  hilly  terrain  [Van 
Der  Weele  1959-60].  These  methods  typically  use  Euler  an¬ 
gles  to  represent  three-dimensional  rotations  [Korn  Sc  Korn 
1968]  (traditionally  denoted  by  the  greek  letters  *>,  6,  and 
u>).  Euler  angles  have  a  number  of  short-comings  for  de¬ 
scribing  rotations  that  become  particularly  noticable  when 
these  angles  become  large. 


There  also  exist  related  digital  procedures  that  con¬ 
verge  rapidly  when  a  good  initial  guess  of  the  relative  ori¬ 
entation  is  available,  as  is  usually  the  case  when  one  is  in¬ 
terpreting  aerial  photography  (Slaina  et  al.  1980].  Not  all 
of  these  methods  use  Euler  angles.  Thompson  [1959b],  for 
example,  uses  twice  the  Gibb's  vector  [Korn  <kr  Korn  1968] 
to  represent  rotations.  These  procedures  may  fail  to  con¬ 
verge  to  the  correct  solution  when  the  initial  guess  is  far  off 
the  mark.  In  the  application  to  motion  vision,  approximate 
translational  and  rotational  components  of  the  motion  are 
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often  not  known  initially,  so  a  procedure  that  depends  on 
good  initial  guesses  is  not  particularly  useful. 

Normally,  the  directions  of  the  rays  are  obtained  from 
points  in  images  generated  by  projection  onto  a  planar  sur¬ 
face.  In  this  case  the  directions  are  confined  to  the  field  of 
view  as  determined  by  the  active  area  of  the  image  plane 
and  the  distance  to  the  center  of  projection.  The  field  of 
view  is  always  less  than  a  hemi-sphere,  since  only  points 
in  front  of  the  camera  can  be  imaged.4  The  method  de¬ 
scribed  here  applies,  however,  no  matter  how  the  directions 
to  points  in  the  scene  are  determined.  There  is  no  restric¬ 
tion  on  the  possible  directions  of  the  rays.  We  do  assume, 
however,  that  we  can  tell  which  of  two  opposite  semi-infinite 
line-segments  the  point  lies  on.  If  a  point,  lies  on  the  correct 
line-segment  we  will  say  that  it  lies  in  front  of  the  camera, 
otherwise  it  will  be  considered  to  be  behind  the  camera. 

The  problem  of  relative  orientation  is  generally  con¬ 
sidered  solved,  and  so  has  received  little  attention  in  the 
photogrammetric  literature  in  recent  times  [Van  Der  Weele 
1959-60].  In  the  annual  index  of  Photogrammetric  Engi¬ 
neering ,  for  example,  there  is  only  one  reference  to  the  sub¬ 
ject  in  the  last  ten  years  [Ghilani  1983]  and  six  in  the  decade 
before  that.  This  is  very  little  in  comparison  to  the  large 
number  of  papers  on  this  subject  in  the  fifties,  as  well  as 
the  sixties,  including  [Gill  1964],  [Sailor  19G5],  (Jochmann 
1965],  [Ghosh  I960],  [Forrest  I960]  and  [Oswal  1967). 

In  this  paper  we  discuss  the  relationship  of  relative 
orientation  to  the  problem  of  motion  vision  in  the  situa¬ 
tion  where  the  motion  between  the  exposure  of  successive 
frames  is  relatively  large.  Also,  a  new  iterative  algorithm 
is  described  here,  as  well  as  a  way  of  dealing  with  the  sit¬ 
uation  when  there  is  no  initial  guess  available  for  the  ro¬ 
tation  or  the  direction  of  the  baseline.  The  advantages  of 
the  unit  quaternion  notation  for  representing  rotations  are 
illustrated  as  well.  Finally,  we  discuss  critical  surfaces,  sur¬ 
face  shapes  that  lead  to  difficulties  in  establishing  a  unique 
relative  orientation. 

3.  Coplanarity  Condition 

If  the  ray  from  the  left  camera  and  the  corresponding  ray 
from  the  right  camera  are  to  intersect,  they  must  to  lie  in 
a  plane  that  also  contains  the  baseline.  Thus,  if  b  is  the 
vector  representing  the  baseline,  rr  the  ray  from  the  right 
projection  center  to  the  point  in  the  scene  and  rj  the  ray 
from  the  left  projection  center  to  the  point  in  the  scene, 
then  the  triple  product 

[b  rj  rr] 

equals  zero,  where  r]  =  Rot(rj)  is  the  left  ray  rotated  into 


the  right  camera’s  coordinate  system.'"  This  is  the  copla¬ 
narity  condition  (see  Fig.  2). 

We  obtain  one  such  constraint  from  each  pair  of  rays. 
There  will  be  an  infinite  number  of  solutions  for  the  base- 
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Figaro  2.  Two  rays  approach  closest  where  they  are  intersected 
by  a  line  perpendicular  to  both.  If  there  is  no  measurement 
error,  and  the  relative  orientation  has  been  recovered  correctly, 
then  the  two  rays  actually  intersert.  In  this  case  the  two  rays 
and  the  baseline  lie  in  a  common  plane. 

lino  and  the  rotation  when  there  are  fewer  than  five  pairs 
of  rays,  since  there  are  five  unknowns  and  each  pair  of  rays 
yields  only  one  constraint.  Conversely,  if  there  are  more 
than  five  pairs  of  rays,  the  constraints  are  likely  to  be  in¬ 
consistent  ns  the  result  of  small  errors  in  the  measurements. 
In  this  case,  no  exact  solution  of  the  set  of  constraint  equa¬ 
tions  exists,  and  it,  makes  sense  instead  to  minimize  the  sum 
of  squares  of  errors  in  the  constraint  equations.  In  practice, 
one  should  use  more  than  five  pairs  of  rays  in  order  to  re¬ 
duce  the  influence  of  measurement  errors  [Jochmann  1 905. 
Ghosh  1900].  We  shall  see  later  that  the  added  information 
also  allows  one  to  eliminate  spurious  solutions. 


4.  What  is  the  Appropriate  Error  Term? 

The  triple  product  [b  r'(  rr]  is  zero  when  the  left,  and  right 
ray  are  coplanar  with  the  baseline.  It  is  not  immediately  ap¬ 
parent,  however,  that  the  triple  product  itself  is  necessarily 
the  ideal  measure  of  departure  from  best  fit.  It  is  worth¬ 
while  exploring  the  geometry  of  the  two  rays  more  carefully. 
Consider  the  points  on  the  rays  where  they  approach  each 
other  the  closest  (see  Fig.  2).  The  line  connecting  these 


1  I  !m  field  of  view  is  larger  than  a  hemi-sphere  in  some  fish-eye  '  1  he  baseline  vector  b  is  here  also  measured  in  the  coordinate 

lenses,  where  there  is  significant  radial  distortion.  system  of  the  right  camera. 
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points  will  be  perpendicular  to  both  rays,  and  hence  paral¬ 
lel  to  (r'(  x  rr).  As  a  consequence,  we  can  write 

a  rj  +  7  (rj  x  rr)  =  b  +  /?  rr, 

where  a  and  8  are  proportional  to  the  distances  along  the 
left  and  the  right  ray  to  the  points  where  they  approach 
most  closely,  while  7  is  proportional  to  the  shortest  distance 
between  the  rays.  We  can  find  7  by  taking  the  dot-product 
of  the  equality  above  with  rj  x  17.  We  obtain 
7 II ri  x  rr]|2  =  [b  r',  rr]. 

Similarly,  taking  the  dot-products  with  rr  x  (rj  x  rr)  and 
rj  x  (rj  x  rr),  we  obtain 


angle  between  the  projected  vectors,  and  so, 
_  [b  rj  jv]  _  [b  rj  rr] 


a  ||r,  x  rr 


|2  =  (b  x  rr)  •  (rj  x  rr), 


0  l|rj  x  rr||  =  (b  x  rj)  ■  (r',  x  rr). 

The  magnitudes  of  a  and  0  are  the  distances  along  the  rays 
to  the  point  of  closest  approach  when  rr  and  rj  are  unit 
vectors.  It  turns  out,  however,  that  we  are  more  interested 
here  in  the  signs  than  in  the  magnitudes  of  a  and  0. 

Normally,  the  points  where  the  two  rays  approach  the 
closest  will  be  in  front  of  both  cameras,  that  is,  both  a 
and  0  will  be  positive.  If  the  estimated  baseline  or  rotation 
is  in  error,  however,  then  it  is  possible  for  one  or  both  of 
the  calculated  parameters  a  and  0  to  be  negative.  We  will 
use  this  observation  later  to  distinguish  amongst  different 
apparent  solutions. 

We  have  shown  that  the  perpendicular  distance  be¬ 
tween  the  left  and  the  right  ray  is  equal  to  ratio  of  the  triple 
product  [b  r'(  rr]  to  the  magnitude  squared  of  (rj  x  rr).  But 
the  measurement  errors  are  in  the  image,  not  in  the  scene. 
Thus  a  least-squares  procedure  should  be  based  on  an  error 
in  determining  the  direction  of  the  rays,  not  on  the  distance 
of  closest  approach.  To  arrive  at  such  a  measure,  we  can 


sin#  = 


ItflllM  II b  x  rj||  ||b  x  rr ||  ’ 

where  we  have  used  the  fact  that  [b  rj  Fr]  =  [b  rj  rr], 
something  that  can  easily  be  verified. 

We  could  use  the  sine  of  the  angle  between  the  pro¬ 
jected  vectors  directly  as  a  measure  of  departure  from  best 
fit.  This  is  not  as  good  an  idea  as  it  may  appear  at  first 
sight,  because  of  what  happens  when  one  of  the  rays  be¬ 
comes  nearly  parallel  to  the  baseline.  In  this  case  the  angle 


look  at  the  angle,  #  say,  between  the  projections  of  the  left 
and  right  ray  into  a  plane  perpendicular  to  b  (see  Fig.  3). 
This  angle  will  be  zero  when  the  vertical  disparity  is  zero, 
that  is,  when  the  left  and  right  rays  are  coplanar  with  the 
baseline. 

The  projections  of  rj  and  rr  into  a  plane  perpendicular 
to  b  are  given  by 

Fj  =  rj  —  (rj  ■  b)b  =  (bx  r'()  x  b, 
rr  =  rr  —  (rr  •  b)b  =  (b  x  17)  x  b, 
where  we  have  used  the  fact  that  b  is  a  unit  vector.  It  is 
easy  to  show  that 

||rj||  =  II b  x  r',||  and  ||rr||  =  ||b  x  rr|| , 
keeping  in  mind  again  that  b  is  a  unit  vector,  and  that  b  x  rj 
and  b  x  rr  are  perpendicular  to  b.  The  cross-product  of 
the  two  projected  vectors  rj  and  Fr  will  be  parallel  to  the 
baseline  and  have  magnitude  proportional  to  the  sine  of  the 


Figure  3.  One  measure  of  the  departure  from  the  coplanarity 
condition  is  the  angle  0  between  the  planes  formed  by  the  left 
ray  and  the  baseline  and  the  plane  formed  by  the  right  ray  and 
the  baseline .  The  angle  may  be  found  by  projection  the  two  rays 
onto  a  plane  perpendicular  to  the  baseline. 

will  vary  rapidly  with  small  changes  in  the  direction  of  the 
ray.  Correspondingly,  one  of  the  terms  in  the  denominator 
in  the  expression  for  the  sine  of  the  angle  becomes  small. 
It  is  better  to  normalize  the  expression  by  multiplying  by 
the  lengths  of  the  projected  vectors.  Then  we  obtain  the 
area  of  the  parallelogram  formed  by  the  projections  of  the 
rays  into  a  plane  perpendicular  to  the  baseline,  namely, 
||rj||  ||rr||  sin#  =  [b  rj  Fr]  =  [b  rj  rr]. 

This  discussion  confirms  that  the  triple  product  itself  is  a 
good  measure  of  the  departure  from  best  fit.  This  is  con¬ 
venient,  since  it  makes  the  least  squares  analysis  reason¬ 
ably  straightforward.  If  one  desires  to  use  a  different  error 
measure,  one  can  weight  the  terms  in  the  sums  to  follow 
accordingly. 
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5.  Least  Squares  Solution  for  the  Baseline 

If  the  rotation  is  known,  it  is  easy  to  find  the  best  fit  base¬ 
line,  as  we  show  next.  This  is  useful,  despite  the  fact  that 
we  do  not  usually  know  the  rotation.  The  reason  is  that  the 
ability  to  find  the  best  baseline,  given  a  rotation,  reduces 
the  dimensionality  of  the  search  space  from  five  to  three. 

Let  { r/.j }  and  {r,  , } ,  for  i  —  !...«,  be  corresponding 
sets  of  left  and  right  rays.  We  wish  to  minimize 

E  =  ^[b  r't ,  rr,,f  =  ^(b  •  (r'(ii  x  iv,t))\ 

1=1  i=i 

subject  to  the  condition  ||b||2  =  1,  where  r| ,  is  the  rotated 
left  ray  r/,j,  as  before.  If  we  let  c,  =  r] ,  X  rrj,  we  can 
rewrite  the  sum  in  the  simpler  form 

E=  ^T(b-c,)2  =  bT  b, 

ia  1  V  1  =  1  / 

where  we  have  used  the  equivalence  b  e,  —  b7c,,  which 
depend-  on  the  interpretation  of  column  vectors  as  3  x  1 
matrices.  The  term  c,c7  is  a  dyadic  product,  a  3  x  3  matrix 
obtained  by  multiplying  a  3  x  1  matrix  by  a  1  x  3  matrix. 

The  error  sum  is  a  quadratic  form  involving  the  real 
symmetric  matrix 

c=ibc'c‘ 

i=i 

The  minimum  of  such  a  quadratic  form  is  the  smallest 
eigenvalue  of  the  matrix  C,  attained  when  b  is  the  corre¬ 
sponding  unit  eigenvector  (see,  for  example,  the  discussion 
of  Rayleigh’s  quotient  in  [Korn  &  Korn  19G8]).  This  can 
be  verified  by  introducing  a  Lagrangian  multiplier  A  and 
minimizing 

E1  =  b7  C  b  +  A(  1  —  b7  b), 

subject  to  the  condition  b;b  =  1.  Differentiating  with 
respect  to  b  and  setting  the  result  equal  to  zero  yields: 

Cb  =  Ab. 

The  error  corresponding  to  a  particular  solution  of  this 
equation  is  found  by  premnltiplying  by  b? : 

1 E  =  b TC  b  =  Ab7  b  =  A. 

The  three  eigenvalues  of  the  real  symmetric  matrix  C  are 
non-negative,  and  can  be  found  in  closed  form  by  solving 
a  cubic  equation,  while  each  of  the  corresponding  eigenvec¬ 
tors  has  components  that  are  the  solution  of  three  homo¬ 
geneous  equations  in  three  unknowns  [Korn  &  Korn  19G8j. 
If  the  data  are  relatively  free  of  measurement  error,  then 
the  smallest  eigenvalue  will  lx-  much  smaller  than  the  other 
two,  and  a  reasonable  approximation  to  the  sought-after 
result  ran  be  obtained  by  solving  for  the  eigenvector  us¬ 
ing  the  assumption  that  the  smallest  eigenvalue  is  actually 
zero.  This  way  one  need  not  even  solve  the  cubic  equation 
(see  also  [Horn  &  Weldon  1988[). 


If  b  is  a  unit  eigenvector,  so  is  — b.  Changing  the 
sense  of  the  baseline  does  not  change  the  magnitude  of  the 
error  term  [b  r)  rr[.  It  does,  however,  change  the  signs  of 
a,  ft  and  7.  One  can  deride  which  sense  of  the  baseline 
direction  is  appropriate  by  determining  the  signs  of  o,  and 
fl,  for  i  =  1 .  .  .  n.  Ideally,  they  should  all  be  positive.  The 
solution  for  the  optimal  baseline  is  not  unique  unless  there 
are  at  least  two  pairs  of  corresponding  rays.  The  reason 
is  that  the  eigenvector  we  are  looking  for  is  not  uniquely 
determined  if  more  than  one  of  the  eigenvalues  is  zero,  and 
the  matrix  has  rank  less  than  two  if  it  is  the  sum  of  fewer 
than  two  dyadic  products  of  independent  vectors.  This  is 
not  a  significant  restriction,  however,  since  we  need  at  least 
five  pairs  of  rays  to  solve  for  the  rotation  anyway. 


6.  Iterative  Improvement  of  Relative  Orientation. 

If  one  ignores  the  orthonormality  of  the  rotation  matrix,  a 
set  of  nine  homogeneous  linear  equations  can  be  obtained 
by  a  transformation  of  the  coplanarity  conditions  that  was 
first  described  in  [Thompson  1959b],  These  equations  can 
be  solved  when  eight  pairs  of  corresponding  ray  directions 
are  known  [Longuet-Higgins  1981).  This  is  not  a  least- 
squares  method  that  can  make  use  of  redundant  measure¬ 
ments,  nor  can  it  be  applied  when  fewer  than  eight  points 
are  given.  The  method  is  also  strongly  affected  by  measure¬ 
ment  errors  and  fails  for  certain  configurations  of  points 
[Longuet-Higgins  1984]. 

No  true  closed-form  solution  of  the  least-squares  prob¬ 
lem  has  been  found  for  the  general  case,  where  both  baseline 
and  rotation  are  unknown.  However,  it  is  possible  to  deter¬ 
mine  how  the  overall  error  is  affected  by  small  changes  in 
the  baseline  and  small  changes  in  the  rotation.  This  allows 
one  to  make  iterative  adjustments  to  the  baseline  and  the 
rotation  that  reduce  the  sum  of  squares  of  errors. 

We  can  represent,  a  small  change  in  the  baseline  by 
an  infinitesimal  quantity  Ab.  If  this  change  is  to  leave  the 
length  of  the  baseline  unaltered,  then 
lib  +  Ab||2  =  ||b||2  , 
or 

l|b||2  +  2b  ■  Ab  4  |JAb|['*  =  HbH2  . 

If  we  ignore  quantities  of  second-order,  we  obtain 

Ab  b  =  0, 

that.  is.  Ab  must  bo  perpendicular  to  b. 

A  small  change  in  the  rotation  can  be  represented  by 
an  infinitesimal  rotation  vector  Au>.  The  direction  of  this 
vector  is  parallel  to  the  axis  of  rotation,  while  its  magnitude 
i-.  the  angle  ot  rotation.  This  incremental  rotation  takes  the 
rotated  left  ray,  r],  into 

r"  =  r[  -t-  (Au>  x  r'(). 


v  *>  vV1* 


This  follows  from  Rodrigues’  formula  for  the  rotation  of  a 
vector  r: 

cos#  r  4sin#(u;  x  r)  4  (1  -  cos#)(u>  •  r)u, 

if  we  let  6  =  PHI,  =  6u>/  PHI,  and  note  that  6lj  is  an 
infinitesimal  quantity.  Finally  then,  we  see  that  [b  r'(  rr] 
becomes 

t(b  4-  6b)  (rj  4  6u  x  r',)  rr], 
or, 

[b  r',  rr]  4  [6b  r'(  rr]  4  [b  (So/  x  r',)  rr], 
if  we  ignore  the  term  [6b  (Su  x  r'()  rr],  because  it  contains 
the  product  of  two  infinitesimal  quantities.  We  can  expand 
two  of  the  triple  products  in  the  expression  above  and  ob¬ 
tain 

[b  rj  rrJ  4  (rj  x  rr)  •  6b  4  (rr  x  b)  ■  (Su  x  rj), 
or 

t  4  c  •  6b  4  d  ■  6o>, 

for  short,  where 

t  -  [b  rj  rr],  c  =  rj  x  rr,  and  d  =  rj  x  (rr  x  b). 
Now,  we  are  trying  to  minimize 

n 

E  =  52  (*i  4  Cj  6b  4  d,  •  6w)2  , 
i=i 

subject  to  the  condition  b  ■  6b  =  0.  This  constrained  min¬ 
imization  problem  can  be  transformed  into  an  equivalent 
unconstrained  form  by  introduction  of  a  Lagrangian  mul¬ 
tiplier.  Instead  of  minimizing  E  itself,  we  then  have  to 
minimize: 

£' =  £  4  A(b  •  6b). 

Differentiating  E'  with  respect  *o  6b,  and  setting  the  result 
equal  to  zero  yields, 

n 

V  ( ti  4  Cj  ■  6b  4  dj  •  6u>)  c,  4  Ab  =  0. 

1=1 

Before  we  can  proceed,  we  need  to  eliminate  the  unknown 
Lagrangian  multiplier  A.  The  dot-product  of  the  expression 
with  b  leads  to 

71 

^  (/,  +  c,  •  tfb  +  dj  •  6u>)  c,  •  b  -f  A(b  •  b)  =  0, 

i-i 

which,  since  b  ■  b  =  1 ,  §ivcs  ns  3.  v&.lue  for  A  thfit  we  csn 
use  to  compute  the  term 

71 

Ab  =  -b  (L  4  c,  •  6b  4  d,  ■  6<j)  c,  •  b, 

1=1 

or 

n 

Ab  =  — (bbT)  52  (*■  4  Cj  ■  fb  4  d,  ■  6 u>)  c,. 

1=1 

Finally,  substituting  for  Ab  in  the  equation  above  we  obtain 

71 

B  52  (*•  4  c,  •  6b  4  d,  ■  6u>)  Cj  =  0, 

1=1 

where 

B  =  (/-  bbT) 

is  the  projection  operator  that  removes  components  of  vec¬ 


tors  parallel  to  the  baseline,  and  I  is  just  the  3x3  identity 
matrix.  We  can  conclude  that  the  equation  above  relates 
quantities  in  a  plane  perpendicular  to  the  baseline. 

Finally,  if  we  differentiate  E'  with  respect  to  6u>  and 
set  this  result  also  equal  to  zero,  we  obtain 

71 

52  (!i  4  Cj  •  6b  4  dj  •  6iu)  d,  =  0. 

i=i 

Together,  the  two  vector  equations  constitute  six  linear 
scalar  equations  in  the  six  unknown  components  of  6b  and 
8u.  We  can  rewrite  them  in  the  more  compact  form: 

BC  6b  4  BF  6uj  =  —Be 

Ft  6b  4  D  8io  =  -d 
or 

(BC  BF \  / 6b \  _  (Bc\ 

D  )({»)--  (dj 

where 

C  =  52cjcir,  ^=52^,  and  D  -  51  d.df , 

i=l  «=1  i=l 

while 

n  n 

c  =  52t.c>  and  d  =  52<,d„ 

i=l  1=1 

The  matrix  B  is  singular,  as  can  be  seen  by  noting  that 
b  is  an  eigenvector  with  zero  eigenvalue.  Thus  the  first 
three  equations  above  are  not  independent.  One  of  them 
will  have  to  be  removed.  Fortunately,  we  also  still  have  to 
incorporate  the  condition  that  6b  be  perpendicular  to  b. 
We  can  do  this  by  replacing  one  of  the  first  three  equations 
with  the  linear  equation  b  •  6b  =  0.  For  the  best  numer¬ 
ical  accuracy  one  should  eliminate  the  equation  with  the 
smallest  coefficients. 

The  above  gives  us  a  way  of  finding  small  changes  in 
the  baseline  and  rotation  that  reduce  the  overall  error  sum. 
This  method  can  be  applied  iteratively  to  locate  a  min¬ 
imum.  Numerical  experiments  confirm  that  it  converges 
rapidly  when  a  good  initial  guess  is  available.  Incremental 
adjustments  cannot  be  computed  if  the  six-by-six  coefficient 
matrix  becomes  singular.  This  will  happen  when  there  are 
fewer  than  five  pairs  of  rays,  and  for  certain  rare  configu¬ 
rations  of  points  in  the  scene  (see  the  discussion  of  critical 
surfaces  later  on). 

7.  Adjusting  the  Baseline  and  the  Rotation 

The  iterative  adjustment  of  the  baseline  is  straightforward: 
b"+I  =  b"  4  6b", 

where  b”  is  the  baseline  estimate  at  the  beginning  of  the  n- 
th  iteration,  while  6bn  is  the  adjustment  computed  during 
the  n  th  iteration,  as  discussed  in  the  previous  section.  If 
6b"  is  not  infinitesimal,  the  result  will  not  be  a  unit  vector. 
We  can,  and  should,  normalize  the  result  by  dividing  by  its 
magnitude. 


9.  Remaining  Ambiguity 


Next,  note  that  the  triple  product,  [b  rj  rr],  changes 
sign,  but  not  magnitude,  when  we  replace  b  with  — b.  Thus 
the  two  possible  senses  of  the  baseline  yield  the  same  sum 
of  squares  of  errors.  However,  changing  the  sign  of  b  does 
change  the  signs  of  both  a  and  0.  All  scene  points  imaged 
are  in  front  of  the  camera,  so  the  distances  should  all  be 
positive.  In  the  presence  of  noise,  it  is  possible  that  some  of 
the  distances  turn  out  to  be  negative,  but  with  reasonable 
data  almost  all  of  them  should  be  positive.  This  allows  us 
to  easily  pick  the  correct  sense  for  the  baseline. 

Not  so  obvious  is  another  possibility:  Suppose  we  turn 
all  of  the  left  measurements  through  7 r  radians  about  the 
baseline,  in  addition  to  the  rotation  already  determined. 
That  is,  replace  rj  with 

r"  =  2(b  ■  r})b  -  rj. 

This  follows  from  Rodrigues’  formula  for  the  rotation  of  a 
vector  r: 

cos#  r  +  sin#(u>  x  r)  +  (1  -  cos#)(tu  ■  r)w, 

with  6  =  7r  and  u  =  b.  Then  the  triple  product  (b  rj  rr] 
turns  into 

2(b  •  rj)[b  b  rr]  -  [b  r',  rr]  =  -[b  rj  rr). 

This,  once  again,  reverses  the  sign  of  the  error  term,  but 
not  its  magnitude.  Thus  the  sum  of  squares  of  errors  is 
unaltered.  The  signs  of  a  and  0  are  affected,  however, 
although  this  time  not  in  as  simple  a  way  as  when  the  sense 
of  the  baseline  was  reversed. 

If  [b  rj  rr]  =  0,  we  find  that  exactly  one  of  a  and 
0  changes  sign.  This  can  be  shown  as  follows:  The  triple 
product  will  be  zero  when  the  left  and  right  rays  are  copla- 
nar  with  the  baseline.  In  this  case  we  have  7  =  0,  and 
so 

a  rj  =  b  +  0rr, 

Takinc  the  cross-product  with  b  ">c  out.  h 
a(rj  x  b)  =  0(rr  x  b), 

If  we  now  replace  rj  by  rj'  =  2(b  •  rj)b  -  rj,  we  have  for  the 
new  distances  a'  and  0'  along  the  rays: 

—a'  (rj  x  b)  =  0’  (rr  x  b), 

We  conclude  that  the  product  a'01  has  sign  opposite  to  that 
of  the  product  a0.  So  if  o  and  0  are  both  positive,  one  of 
a'  or  0'  must  be  negative. 

In  the  presence  of  measurement  error  the  triple  product 
will  not  be  exactly  equal  to  zero.  If  the  rays  are  nearly 
coplanar  with  the  baseline,  however,  we  find  that  one  of  a 
and  0  almost  always  changes  sign.  With  very  poor  data, 
it  is  possible  that  both  change  sign.  (Even  with  totally 
random  ray  directions,  however,  this  only  happens  27.3% 
of  the  time,  as  determined  by  Monte  Carlo  methods).  In 
any  case,  we  can  reject  a  solution  in  which  roughly  half  the 
distances  are  negative.  Moreover,  we  can  find  the  correct 
solution  directly  by  introducing  an  additional  rotation  of  7r 
radians  about  the  baseline. 


If  we  take  care  of  the  three  apparent  two-way  ambiguities 
discussed  in  the  previous  section,  we  find  that  in  practice  a 
unique  solution  is  found,  provided  that  a  sufficiently  large 
number  of  ray  pairs  are  available.  That  is,  the  method  con¬ 
verges  to  the  unique  global  minimum  from  every  possibly 
starting  point  in  parameter  space. 

Local  minima  in  the  sum  of  squares  of  errors  appear 
when  only  a  few  more  than  the  minimum  of  five  ray  pairs 
are  available  (as  is  common  in  practice).  This  means  that 
one  has  to  repeat  the  iteration  with  different  starting  val¬ 
ues  for  the  rotation  in  order  to  locate  the  global  minimum. 
A  starting  value  for  the  baseline  can  be  found  in  each  case 
using  the  closed-form  method  described  in  section  0.  To 
search  the  parameter  space  effectively,  one  needs  a  way  of 
efficiently  sampling  the  space  of  rotations.  The  space  of  ro¬ 
tations  is  isomorphic  to  the  unit  sphere  in  four  dimensions, 
with  antipodal  points  identified.  The  rotation  groups  of  the 
regular  polyhedra  provide  convenient  means  of  uniformly 
sampling  the  space  of  rotations.  The  group  of  rotations 
of  the  tetrahedron  has  12  elements,  that  of  the  hexahedron 
and  the  octahedron  has  24,  and  that  of  the  icosahedron  and 
the  dodecahedron  has  60  (representations  of  these  groups 
are  given  in  Appendix  A  for  convenience).  One  can  use 
these  as  starting  values  for  the  rotation.  Alternatively,  one 
can  just  generate  a  number  of  randomly  placed  points  on 
the  unit  sphere  in  four  dimensions  as  starting  values  for  the 
rotation. 

When  there  are  only  five  pairs  of  rays,  the  situation  is 
different  again.  In  this  case,  we  have  five  non-linear  equa¬ 
tions  in  five  unknowns  and  so  in  general  expect  to  find  a 
finite  number  of  exact  solutions.  That  is,  it  is  possible  to 
find  baselines  and  rotations  that  satisfy  the  coplanarity  con¬ 
ditions  exactly  and  reduce  the  sum  of  squares  of  errors  to 
zero.  If  we  ignore  the  three  types  of  ambiguities  discussed 
above  (incuding  the  rotation  by  7r  radians  about  the  base¬ 
line),  then  there  are  generally  four  distinct  sets  of  baselines 
and  rotations  that  yield  an  exact  solution.  Typically  only 
one  of  these  yields  positive  signs  for  all  of  the  distances. 
These  are  empirical  observations;  I  have  not  been  able  to 
prove  that  there  are  generally  four  solutions  that  satisfy  the 
coplanarity  conditions. 

10.  Summary  of  the  Algorithm 

Consider  first  the  case  where  we  have  an  initial  guess  for  the 
rotation.  We  start  by  finding  the  best-fit  baseline  direction 
using  the  closed-form  method  described  in  section  5.  We 
determine  the  correct  sense  of  the  baseline  by  choosing  the 
one  that  makes  most  of  the  signs  of  the  distances  positive. 
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Then  we  proceed  as  follows: 

•  For  each  pair  of  corresponding  image  points,  we 
compute  r',  t,  the  left  ray  direction  r(,,  rotated 
into  the  right  camera  coordinate  system,  using  the 
present  guess  for  the  rotation. 

•  We  then  compute  the  cross-product  C{  =  r|  ,■  xrr,i, 
the  double  cross-product  d,  =  rj,  x  ( rrl  x  b)  and 
the  triple-product  t,  =  [b  r'(  t  rr,j]. 

•  We  accumulate  the  dyadic  products  Cjdf 

and  d.df,  as  well  as  the  vectors  f,c,  and  t{d{. 
The  totals  of  these  quantities  over  all  image  point 
pairs  give  us  the  matrices  C,  F,  D  and  the  vectors 
c  and  d. 

•  We  can  now  solve  for  the  increment  in  the  baseline 
£b  and  the  increment  in  the  rotation  8ui  using  the 
method  derived  in  section  6. 

•  We  adjust  the  baseline  and  the  rotation  using  the 
methods  discussed  in  section  7,  and  recompute  the 
sum  of  the  squares  of  the  error  terms. 

The  new  orientation  parameters  are  then  used  in  the  next 
iteration  of  the  above  sequence  of  steps.  As  is  the  case 
with  most  iterative  procedures,  it  is  hard  to  decide  when  to 
stop.  Typically,  the  total  error  becomes  small  after  a  few 
iterations  and  no  longer  decreases  at  each  step,  because 
of  limited  accuracy  in  the  arithmetic  operations.  One  can 
stop  the  iteration  the  first  time  the  error  increases.  Alterna¬ 
tively,  one  can  stop  after  either  a  fixed  number  of  iterations 
or  after  the  error  becomes  less  then  some  predetermined 
threshold. 

When  the  decision  has  been  made  to  stop  the  itera¬ 
tion,  a  check  of  the  signs  of  the  distances  along  the  rays 
is  in  order.  If  most  of  them  are  negative,  the  baseline  di¬ 
rection  should  be  reversed.  If  neither  sense  of  the  baseline 
direction  yields  mostly  positive  distances,  one  needs  to  con¬ 
sider  the  possibility  of  a  rotation  through  n  radians  about 
the  baseline  b.  If  this  also  yields  mixed  signs,  one  is  dealing 
with  a  local  extremum  of  the  error  function;  something  that 
will  only  happen  if  the  initial  guess  is  in  fact  not  adequate. 

If  an  initial  guess  is  not  available,  one  proceeds  as  fol¬ 
lows: 

•  For  each  rotation  in  the  chosen  group  of  rotations, 
perform  the  above  iteration  to  obtain  a  candidate 
baseline  and  rotation. 

•  Choose  the  solution  that  yields  the  smallest  total 
error. 

When  there  are  many  pairs  of  rays,  the  iterative  algorithm 
will  converge  to  the  global  minimum  error  solution  from  any 
initial  guess  for  the  rotation.  There  is  no  need  to  sample  the 
space  of  rotations  in  this  case.  Also,  instead  of  sampling  the 


space  of  rotations  in  a  systematic  way  using  a  finite  group 
of  rotations,  one  can  generate  points  randomly  distributed 
on  the  surface  of  the  unit  sphere  in  four-dimensional  space. 
This  provides  a  simpler  means  of  generating  initial  guesses, 
although  more  initial  guesses  may  have  to  be  tried  than 
when  a  systematic  procedure  is  used  to  sample  the  space  of 
rotations. 

The  method  given  minimizes  the  sum  of  the  squares  of 
the  triple  products 

[b  r'(  rr]. 

If  desired,  one  can  modify  it  to  use  some  multiple  of  the 
triple  product  as  an  error  term  by  weighting  the  contribu¬ 
tion  to  the  overall  sum.  This  can  lead  to  a  considerably 
more  complex  optimization  problem  if  the  weights  depend 
on  the  unknown  baseline  and  the  unknown  rotation.  This 
happens,  for  example,  if  we  try  to  minimize  the  sum  of 
squares  of  the  sines  of  the  angles  corresponding  to  vertical 
disparity: 

-inf.  [b  rl  rr] 

II b  x  rj ||  ||b  x  rr || 

If  we  assume  that  the  weighting  factors  vary  slowly  dur¬ 
ing  the  iterative  process,  however,  we  can  to  use  the  cur¬ 
rent  estimates  of  the  baseline  and  rotation  in  computing 
the  weighting  factors,  and  not  take  into  account  the  small 
variations  in  the  error  sum  due  to  changes  in  the  weighting 
factors.  That  is,  when  taking  derivatives,  the  weighting  fac¬ 
tors  are  considered  constant.  This  is  a  good  approximation 
when  the  parameters  vary  slowly,  as  they  will  when  one  is 
close  to  an  extremum. 

11.  Critical  Surfaces 

In  certain  rare  cases,  relative  orientation  cannot  be  recov¬ 
ered  fully,  even  when  there  are  five  or  more  pairs  of  rays. 
Normally,  each  error  term  varies  linearly  with  distance  in 
parameter  space  from  the  location  of  an  extremum,  and  so 
the  sum  of  squares  of  errors  varies  quadratically.  There  are 
situations,  however,  where  the  error  terms  to  not  vary  lin 
early  with  distance,  but  quadratically  or  higher  order,  in 
certain  special  directions  in  parameter  space.  In  this  case, 
the  sum  of  squares  of  errors  does  not  vary  quadratically 
with  distance  from  the  extremum,  but  as  a  function  of  the 
fourth  or  higher  power  of  this  distance.  This  makes  it  very 
difficult  to  accurately  locate  the  extremum.  In  an  extreme 
situation,  the  total  error  may  not  vary  at  all  along  some 
curve  in  parameter  space.  In  this  case,  the  total  error  is 
unaffected  by  a  small  change  in  the  rotation,  as  long  as  this 
change  is  accompanied  by  a  corresponding  small  change  in 
the  baseline.  There  is  no  localized  extremum  and  conse¬ 
quently  the  solution  for  relative  orientation  is  not  unique. 
It  turns  out  that  this  problem  arises  only  when  the  ob- 


served  scene  points  lie  on  certain  surfaces  called  Gefahrliche 
Flachen  or  critical  surfaces  [Brandenberger  1947,  Hofmann 
1949,  Zeller  1952,  Schwidefsky  1973].  We  show  next  that 
only  points  on  certain  hyperboloids  of  one  sheet  and  their 
degenerate  forms  can  lead  to  this  kind  of  problem  (see  also 
[Horn  1987b]). 

We  could  try  to  find  a  direction  of  movement  in  pa¬ 
rameter  space  (i5b,  8us)  that  leaves  the  total  error  unaf¬ 
fected  when  given  a  particular  surface.  Instead,  we  will 
take  the  critical  direction  of  motion  in  the  parameter  space 
as  given,  and  try  to  find  a  surface  for  which  the  total  error 
is  unchanged. 

Let  R  be  a  point  on  the  surface,  measured  in  the  right 
camera  coordinate  system.  Then 

/5  rr  =  R  and  a  r'(  =  b  +  R, 
for  some  positive  a  and  i3.  In  the  absence  of  measurement 
errors, 

[b  r',  rr]  =  ^[b  (b  +  R)  R]  =  0. 

We  noted  earlier  that  when  we  change  the  baseline  and  the 
rotation  slightly,  the  triple  product  [b  r]  rrJ  becomes 

[(b  +  6b)  (r)  +  r))  rr), 

or,  if  we  ignore  higher-order  terms, 

[b  r',  rr]  +  (rj  x  rr)  •  6b  -f  (rr  x  b)  •  (8 us  x  r',). 

The  problem  we  are  focusing  on  here  arises  when  this  error 
term  is  unchanged  for  small  movement  in  some  direction  in 
the  parameter  space.  That  is  when 

(r'(  x  rr)  •  6b  +  (rr  x  b)  •  (6u>  x  r'()  =  0, 
for  some  6b  and  6w.  Introducing  the  coordinates  of  the 
imaged  points  we  obtain: 

^(((b  +  R)  x  R)  •  6b  +  (R  x  b)  •  {ius  x  (b  +  R)))  =  0. 
or 

(R  x  b)  ■  (6w  x  R)  +  (R  x  b)  •  (6u  xb)+[bR  6b]  =  0. 

If  we  expand  the  first  of  the  dot-products  of  the  cross- 
products,  we  can  write  this  equation  in  the  form 

(R  •  b)(6u>  ■  R)  —  (b  ■  6u;)(R  ■  R)  +  L  ■  R  =  0, 

where 

L  =  t  x  b,  while  l  =  b  x  8us  +  6b. 

The  expression  on  the  left-hand  side  contains  a  part  that 
is  quadratic  in  R  and  a  part  that  is  linear.  The  expression 
is  clearly  quadratic  in  X,  Y.  and  Z,  the  components  of  the 
vector  R  =  (-Y.  Y.Z)r.  Thus  a  surface  leading  to  the  kind 
of  problem  described  above  must  be  a  quadric  surface  [Korn 
Korn  1968]. 

Note  that  there  is  no  constant  term  in  the  equation  of 
the  surface,  so  R  =  0  satisfies  the  equation.  This  means 
that  the  surface  passes  through  the  right  projection  center. 
It  is  easy  to  verify  that  R  =  -b  satisfies  the  equation  also. 


which  means  that  the  surface  passes  through  the  left  pro¬ 
jection  center  as  well.  In  fact,  the  whole  baseline  (and  its 
extensions),  R  —  kb,  lies  in  the  surface.  This  means  that 
we  must  be  dealing  with  a  ruled  quadric  surface.  It  can  con¬ 
sequently  not  be  an  ellipsoid  or  hyperboloid  of  two  sheets, 
or  one  of  their  degenerate  forms.  The  surface  must  be  a 
hyperboloid  of  one  sheet,  or  one  of  its  degenerate  forms. 
Additional  information  about  the  properties  of  these  sur¬ 
faces  is  given  in  Appendix  B,  while  the  degenerate  forms 
are  explored  in  Appendix  C. 

It  should  be  apparent  that  this  kind  of  ambiguity  is 
quite  rare.  This  is  nevertheless  an  issue  of  practical  impor¬ 
tance,  however,  since  the  accuracy  of  the  solution  is  reduced 
if  the  points  lie  near  some  critical  surface.  A  textbook  case 
of  this  occurs  in  aerial  photography  of  a  roughly  U-shaped 
valley  taken  along  a  flight  line  parallel  to  the  axis  of  the 
valley  from  a  height  above  the  valley  floor  approximately 
equal  to  the  width  of  the  valley.  In  this  case  the  baseline 
lies  on  a  circular  cylinder  that  also  lies  close  to  the  surface 
on  which  the  imaged  points  lie.  This  means  that  it  is  close 
to  one  of  the  degenerate  forms  of  the  hyperboloid  of  one 
sheet  (see  Appendix  C). 


Note  that  hyperboloids  of  one  sheet  and  their  degen¬ 
erate  forms  are  exactly  the  surfaces  that  lead  to  ambigu¬ 
ity  in  the  case  of  motion  vision.  The  coordinate  systems 
and  symbols  have  been  chosen  here  to  make  the  correspon¬ 
dence  between  the  two  problems  more  obvious.  The  re¬ 
lationship  between  the  two  situations  is  nevertheless  not 
quite  as  transparent  as  I  had  thought  earlier  [Horn  1987b]. 
In  the  case  of  the  ambiguity  of  the  motion  field,  for  exam¬ 
ple.  we  are  dealing  with  a  two-way  ambiguity  arising  from 
infinitesimal  displacements.  Here  we  are  dealing  with  an 
infinite  number  of  solutions  arising  from  images  taken  with 
cameras  that  have  large  differences  in  position  and  orien¬ 
tation.  Also  note  that  the  symbol  8us  stands  for  a  small 
change  in  a  finite  rotation  here,  while  it  refers  to  a  dif¬ 
ference  in  instantaneous  rotational  velocities  in  the  motion 
vision  case. 

12.  Conclusions 

Methods  for  recovering  the  relative  orientation  of  two  cam¬ 
eras  are  of  importance  in  both  binocular  stereo  and  mo¬ 
tion  vision.  A  new  iterative  method  for  finding  the  relative 
orientation  has  been  described  here.  It  can  be  used  even 
when  there  is  no  initial  guess  available  for  the  rotation  or 
the  baseline.  The  new  method  does  not  use  Euler  angles  to 
represent  the  orientation. 

When  there  are  many  pairs  of  corresponding  image 
points,  the  iterative  method  finds  the  global  minimum  from 
any  starting  point  in  parameter  space.  Local  minima  in  the 
sum  of  squares  of  errors  occur,  however,  when  there  are  rel- 
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atively  few  pairs  of  corresponding  image  points  available. 
Method  for  efficiently  locating  the  global  minimum  in  this 
case  were  discussed.  When  only  five  pairs  of  correspond¬ 
ing  image  points  are  given,  several  exact  solutions  of  the 
coplanarity  equations  can  be  found.  Typically  only  one  of 
these  yields  positive  distances  to  the  points  in  the  scene, 
however.  This  allows  one  to  pick  the  correct  solution  even 
when  there  is  no  initial  guess  available. 

All  of  these  methods  fail  in  the  rare  case  that  the  scene 
points  lie  on  a  critical  surface. 
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Abstract 

In  this  paper,  we  present  an  approach  to  color  im¬ 
age  understanding  that  can  be  used  to  segment  and 
analyze  surfaces  with  color  variations  due  to  high¬ 
lights  and  shading.  We  begin  with  a  theory  that  re¬ 
lates  tha  reflected  light  from  dielectric  materials,  such 
as  plastic,  to  fundamental  physical  reflection 
processes,  and  describes  the  color  of  the  reflected 
light  as  a  linear  combination  of  the  color  of  the  light 
due  to  surface  reflection  (highlights)  and  body  reflec¬ 
tion  (object  color).  This  theory  is  used  in  an  algorithm 
that  separates  a  color  image  into  two  parts:  an  image 
of  just  the  highlights,  and  the  original  image  with  the 
highlights  removed.  In  the  past,  we  have  applied  this 
method  to  hand-segmented  images.  The  current 
paper  shows  how  to  perform  automatic  segmentation 
method  by  applying  this  theory  in  stages  to  identify 
the  object  and  highlight  colors.  The  result  is  a  com¬ 
bination  of  segmentation  and  reflection  analysis  that  is 
better  than  traditional  heuristic  segmentation  methods 
(such  as  histogram  thresholding),  and  provides  impor¬ 
tant  physical  information  about  the  surface  geometry 
and  material  properties  at  the  same  time.  We  also 
show  the  importance  of  modeling  the  camera 
properties  for  this  kind  of  quantitative  analysis  of 
color.  This  line  of  research  can  lead  to  physics-based 
image  segmentation  methods  that  are  both  more  reli¬ 
able  and  more  useful  than  traditional  segmentation 
methods' 

1.  Introduction 

One  of  the  key  goals  of  computer  vision  is  to  inter¬ 
pret  the  scene  as  a  collection  of  shiny  and  matte  sur¬ 
faces,  smooth  and  rough,  interacting  with  light,  shape, 
and  shadow.  However,  computer  vision  has  not  yet 
been  successful  at  deriving  such  a  description  of  sur¬ 
face  and  illumination  properties  from  an  image.  The 
key  reason  for  this  failure  has  been  a  lack  of  models 
or  descriptions  rich  enough  to  relate  pixels  and  pixel- 
aggregates  to  scene  characteristics.  In  the  past,  most 
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work  with  color  images  has  considered  object  color  to 
be  a  constant  property  of  an  object,  and  color  varia¬ 
tion  on  an  object  was  attributed  to  noise12  3. 
However,  color  variation  in  real  scenes  depends  to  a 
much  larger  degree  on  the  optical  reflection  properties 
of  the  scene.  This  variation  causes  the  perception  of 
object  color,  highlights,  shadows  and  shading4'5, 
scene  properties  that  can  be  determined  and  used  by 
color  vision  algorithms. 

This  paper  presents  an  approach  to  color  image 
understanding  that  accounts  for  color  variations  due 
to  highlights  and  shading.  We  use  a  reflection  model 
which  describes  pixel  colors  as  a  linear  combination 
of  an  object  color  and  a  highlight  color6.  All  color 
pixels  from  one  object  form  a  planar  cluster  in  the 
color  space.  The  cluster  shape  is  determined  by  the 
object  and  highlight  colors  and  by  the  object  shape 
and  illumination  geometry.  We  extend  our  reflection 
model  with  a  sensor  model  which  accounts  for 
camera  properties,  such  as  a  limited  dynamic  range, 
blooming,  gamma-correction,  and  chromatic  aber¬ 
ration.  This  allows  us  to  apply  our  algorithms  to  real 
images,  instead  of  only  to  synthesized  images. 

We  have  previously  reported  how  this  theory  can  be 
used  to  separate  color  images  into  two  intrinsic 
images,  one  showing  the  scene  without  highlights, 
and  the  other  showing  only  the  highlights7.  In  the 
past,  we  have  applied  this  method  to  hand-segmented 
images.  The  current  paper  describes  an  automatic 
segmentation  method  that  is  based  on  the  extended 
dichromatic  reflection  model.  Our  approach  alter¬ 
nates  between  generating  hypotheses  about  the 
scene  from  the  image  data  and  verifying  whether  the 
hypotheses  fit  the  image.  The  hypotheses  relate  ob¬ 
ject  color,  shading,  highlights  and  camera  limitations 
to  the  shapes  of  color  clusters  in  local  image  areas. 
By  using  this  control  structure,  driven  by  a  physical 
model  of  light  reflection,  we  are  able  to  incrementally 
identify  local  and  global  properties  of  the  scene,  such 
as  object  and  illumination  colors,  and  to  use  them  in 
interpreting  pixels  in  the  images.  This  allows  us  to 
adapt  the  image  interpretation  process  to  local  scene 
characteristics  and  to  react  differently  to  color  and  in¬ 
tensity  changes  at  different  places  in  the  image.  This 
method  stands  in  contrast  to  Gershon's  approach8,  in 
which  he  begins  with  a  traditional  segmentation  and 
follows  it  with  a  physics-based  post-processing  step. 
His  approach  suffers  from  erroneous  region  boun¬ 
daries  created  by  the  initial  segmentation,  while  our 
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method  uses  the  physical  model  from  the  beginning 
and  therefore  does  not  make  such  unrecoverable  mis¬ 
takes. 

The  result  is  a  combination  of  segmentation  and 
reflection  analysis  that  is  better  than  traditional  heuris¬ 
tic  segmentation  methods  that  base  their  analysis  on 
intensity  or  color  differences  or  on  a  fixed  set  of  user- 
defined  features,  such  as  intensity,  hue  and  satura¬ 
tion.  Traditional  algorithms  also  cannot  distinguish 
reliably  between  different  edge  types,  such  as  high¬ 
light  edges,  material  edges  and  shading  or  shadow 
edges,  and  they  cannot  account  for  camera  limita¬ 
tions.  Moreover,  our  method  generates  important 
physical  information  about  the  scene.  This  infor¬ 
mation  significantly  simplifies  our  subsequent  step  to 
separate  color  images  into  two  intrinsic  reflection 
images7,  since  the  segmentation  already  provides  all 
necessary  information  about  color  variation  on  an  ob¬ 
ject.  When  combined  with  methods  to  interpret  the 
intrinsic  images9  4,  this  line  of  research  can  lead  to 
physics-based  image  segmentation  methods  that  are 
both  more  reliable  and  more  useful  than  traditional 
segmentation  methods. 

2.  The  Dichromatic  Reflection  Model 

On  its  path  from  a  light  source  to  the  camera,  a  light 
ray  is  altered  in  many  characteristic  ways  by  the  ob¬ 
jects  in  the  scene.  The  camera  then  encodes  the 
measured  light  in  a  (color)  pixel.  It  is  the  goal  of 
image  understanding  methods  to  use  properties  and 
relationships  between  pixels  to  recover  a  description 
of  the  scene.  It  is  essential  to  the  success  of  such 
methods  that  they  understand  and  model  the  reflec¬ 
tion  processes  in  the  scene,  as  well  as  the  sensing 
characteristics  of  the  camera.  How  light  is  reflected 
from  an  object  depends  on  the  material  of  the  object. 
It  is  common  to  distinguish  between  conducting 
materials,  such  as  metals,  and  dielectric  (non¬ 
conducting)  materials,  such  as  plastics,  paints,  papers 
and  ceramics.  In  the  following,  we  will  present  a 
model  that  describes  the  reflection  processes  of 
dielectric  materials. 

Most  dielectric  materials  are  inhomogeneous.  They 
consist  of  a  medium  and  some  embedded  pigments, 
as  shown  in  Figure  2-1.  The  pigments  selectively 
absorb  light  and  scatter  it  by  reflection  and  refraction. 
When  we  look  at  objects  that  are  made  out  of 
dielectric  material,  we  usually  see  the  reflected  light 
as  composed  of  two  distinct  colors  that  typify  the  high¬ 
light  areas  and  the  matte  object  parts.  This  is  a 
characteristic  property  of  many  dielectrics,  and  our 
reflection  model  capitalizes  on  this  characteristic  color 
change  between  matte  and  highlight  areas. 

When  light  hits  the  surface  of  a  dielectric  material, 
the  change  in  refraction  indices  causes  the  light  to 
partially  reflect  back  into  the  air  and  to  partially  refract, 
penetrating  into  the  material  body.  We  refer  to  me 


reflection  process  at  the  material  surface  as  surface 
reflection.  It  generally  appears  as  a  highlight  or  as 
gloss  on  an  object.  Fresnel's  law  describes  how  the 
reflected  light  depends  on  the  refraction  indices  of  the 
material  and  the  surrounding  medium,  on  the  in¬ 
cidence  angle  and  on  the  polarization  of  the  light10. 
When  modelling  surface  reflection,  however,  it  is  com¬ 
mon  to  make  some  simplifying  assumptions.  Assum¬ 
ing  that  little  or  no  light  absorption  occurs  at  the 
material  surface,  we  may  conclude  that  the  light 
reflected  from  the  surface  has  the  same  color  as  the 
illumination.  This  is  a  characteristic  feature  of  high¬ 
lights  on  most  dielectric  materials.  Depending  on  the 
smoothness  of  the  surface,  the  light  may  be  reflected 
in  a  preferred  direction  (mirror  reflection)  or  it  may  be 
scattered  into  many  directions.  Several  models  have 
been  developed  in  the  physics  and  computer  graphics 
communities  to  describe  the  geometric  properties  of 
light  reflection  from  rough  surfaces11- 12- 13. 


Figure  2-1 :  Reflection  from  a  dielectric  material 

For  dielectric  materials,  not  all  incident  light  is 
reflected  at  the  material  surface.  Some  percentage  of 
the  light  penetrates  into  the  material  body.  The 
refracted  light  beam  travels  through  the  medium,  hit¬ 
ting  pigments  from  time  to  time.  The  pigments  scatter 
the  light  and  partially  or  entirely  absorb  it  at  some 
wavelengths14, 1S.  Some  of  the  light  finally  exits  from 
the  material  body  back  into  the  air.  We  refer  to  this 
reflection  process  as  body  reflect'on.  Its  geometric 
and  photometric  properties  depend  on  many  factors: 
the  transmitting  properties  of  the  medium,  the  scat¬ 
tering  and  absorption  properties  of  the  pigments,  and 
the  shape  and  distribution  (including  density  and 
orientation)  of  the  pigments.  If  we  assume  a  random 


distribution  of  the  pigments,  the  light  exits  in  random 
directions  from  the  body.  In  the  extreme,  when  the 
exiting  light  is  uniformly  distributed,  the  distribution 
can  be  described  by  Lambert's  law.  The  distribution 
of  the  pigments  also  influences  the  amount  and  the 
color  of  the  reflected  light.  If  the  pigments  are  dis¬ 
tributed  randomly  in  the  material  body,  we  may  expect 
that,  on  the  average,  the  same  amount  and  color  will 
be  absorbed  everywhere  in  the  material  before  the 
light  exits.  In  such  a  case,  the  light  that  is  reflected 
from  the  material  body  has  the  same  color  over  the 
entire  surface. 

According  to  the  above  discussion,  our  reflection 
model  describes  the  light,  L(k,i,e,g)2,  which  is 
reflected  from  an  object  point  as  a  mixture  of  the  light 
Ls(K,i,e,g)  reflected  at  the  material  surface  and  the 
light  Lb(X,i,e,g)  reflected  from  the  material  body. 

L(K,i,e,g)  =  Ls{\,i,c,z)  +  Lb(X,i,e,g)  ( i ) 

If  we  assume  that  there  is  only  a  single  light  source 
in  the  scene,  with  no  inter-reflection  between  objects, 
and  that  highlights  (surface  reflection)  have  the  same 
color  as  the  illumination,  we  can  then  separate  the 
spectral  reflection  properties  of  Ls  from  its  geometric 
reflection  properties.  We  thus  model  it  as  a  product  of 
a  spectral  power  distribution,  cs(k),  and  a  geometric 
scale  factor,  ms(i,e,g),  which  describes  the  intensity  of 
the  reflected  light.  Similarly,  we  separate  the  body 
reflection  component  Lb  into  a  spectral  power  distribu¬ 
tion,  cb(K),  and  a  geometric  scale  factor,  mb(i,e,g). 
Substituting  these  terms  into  equation  (1),  we  obtain 
the  Dichromatic  Reflection  Model  equation: 

(2) 

L.(k,i,e,g)  =  ms{i,e^)cs(K)  +  mh(i.c,g)ch(l^ 

We  thus  describe  the  light  that  is  reflected  from  an 
object  point  as  a  mixture  of  two  distinct  spectral  power 
distributions,  cs(k)  and  cb(X),  each  of  which  is  scaled 
according  to  the  geometric  reflection  properties  of  sur¬ 
face  and  body  reflection.  In  the  infinite-dimensional 
vector  space  of  spectral  power  distributions  (each 
wavelength  defines  an  independent  dimension  in  this 
vector  space1®'17),  the  reflected  light  can  thus  be 
described  as  a  linear  combination  of  the  two  vectors 
cs(K)  and  cb(K). 

Many  reflection  models,  which  have  been 
developed  in  the  physics  and  computer  graphics  com¬ 
munities18,  4' 12, 13  are  special  cases  of  the  model 
described  here6.  They  replace  the  geometric  vari¬ 
ables,  ms  and  mb,  by  specific  functions  that  ap¬ 
proximate  the  measured  reflection  data  of  a  chosen 
set  of  typical  materials.  In  our  work,  we  concentrate 
on  the  spectral  variables  in  equation  2,  cs(K)  and 


h,e.  and  g  describe  the  angles  of  the  incident  and  emilled  light  and 
the  phase  angle;  k  is  the  wavelength  parameter. 


cb(X),  exploiting  the  color  difference  between  them. 
We  leave  the  geometrical  factors  unspecified. 

3.  Object  Shape  and  Color  Variation 

We  will  now  discuss  the  relationship  between  the 
light  mixtures  of  all  points  on  an  object.  We  study  the 
spectral  variation  over  an  entire  object  by  analyzing 
the  histogram  of  the  light  mixtures  from  all  object 
points. 


Figure  3-1 :  The  shape  of  the  spectral  cluster  for  a 
cylindrical  object 

An  investigation  of  the  geometrical  properties  of 
surface  and  body  reflection  reveals  that  the  light  mix¬ 
tures  form  a  dense  spectral  cluster  in  the  dichromatic 
plane.  The  shape  of  this  cluster  is  closely  related  to 
the  shape  of  the  object,  as  we  will  now  describe.  For 
illustration  purposes,  we  will  assume  in  the  following 
discussion  of  spectral  histograms  that  body  reflection 
is  approximately  Lambertian  and  that  surface  reflec¬ 
tion  is  describable  by  a  function  with  a  sharp  peak 
around  the  angle  of  perfect  mirror  reflection.  Note, 
however,  that  this  analysis  is  not  limited  to  a  particular 
geometric  reflection  model.  Figure  3-1  shows  a 
sketch  of  a  shiny  cylinder.  The  left  part  of  the  figure 
displays  the  magnitudes  of  the  body  and  surface 


reflection  components.  The  curves  show  the  loci  of 
constant  body  or  surface  reflection.  The  darker 
curves  are  the  loci  of  constant  surface  reflection. 
Since  mji,e,g)  decreases  sharply  around  the  object 
point  with  maximal  surface  reflection,  msmax,  these 
curves  are  shown  only  in  a  small  object  area.  We  call 
the  points  in  this  area  highlight  points.  The  remaining 
object  points  are  matte  points.  The  right  part  of  the 
figure  shows  the  corresponding  spectral  histogram  in 
the  dichromatic  plane.  As  we  will  descrit  9  below,  the 
object  points  form  two  linear  clusters  in  th  *  histogram. 

For  matte  points,  the  surface  reflection  component 
of  the  reflected  light  is  negligible  and  all  the  reflected 
light  comes  from  body  reflection.  The  observed  light 
at  such  points  thus  depends  only  on  cb(k),  scaled  by 
mb(i,e,g)  according  to  the  geometrical  relationship  be¬ 
tween  the  local  surface  normal  of  the  object  and  the 
viewing  and  illumination  directions.  Consequently,  the 
matte  points  form  a  matte  line  in  the  dichromatic  plane 
in  the  direction  of  the  body  reflection  vector,  cb(k),  as 
shown  in  the  right  part  of  Figure  3-1 . 

Highlight  points  exhibit  both  body  reflection  and 
surface  reflection.  However,  since  mji,e,g)  is  much 
more  sensitive  to  a  small  change  in  the  photometric 
angles  than  mb(i,e,g),  the  body  reflection  component 
is  generally  approximately  constant  in  a  highlight 
area,  as  displayed  by  the  curve  with  label  mbH  in 
Figure  3-1.  Accordingly,  the  second  term  of  the 
Dichromatic  Reflection  Model  equation  (2)  has  a  con¬ 
stant  value,  mbH  cb(k),  and  all  spectral  variation  within 
the  highlight  comes  from  varying  amounts  of  mji,e,g). 
The  highlight  points  thus  form  a  highlight  line  in  the 
dichromatic  plane  in  the  direction  of  the  surface  reflec¬ 
tion  vector,  cs(k).  The  line  departs  from  the  matte  line 
at  position  mbhpb(K),  as  shown  in  Figure  3-1.  More 
precisely,  the  highlight  cluster  looks  like  a  slim, 
skewed  wedge  because  of  the  small  variation  of  the 
body  reflection  component  over  the  highlight. 

The  combined  spectral  cluster  of  matte  and  high¬ 
light  points  thus  looks  like  a  skewed  T.  The  skewing 
angle  of  the  T  depends  on  the  spectral  difference  be¬ 
tween  the  body  and  surface  reflection  vectors  while 
the  position  and  width  of  the  highlight  line  depend  on 
the  illumination  geometry:  If  the  phase  angle  g  be¬ 
tween  the  illumination  and  viewing  direction  at  a  high¬ 
light  is  very  small,  the  incidence  direction  of  the  light  is 
close  to  the  surface  normal.  The  underlying  amount 
of  body  reflection  thus  is  very  high.  The  highlight  line 
then  starts  near  the  tip  of  the  matte  line,  and  the 
skewed  T  becomes  a  skewed  L.  The  wider  the  phase 
angle  g  is,  the  smaller  is  the  amount  of  underlying 
body  reflection.  We  have  investigated  this  relation¬ 
ship  more  precisely  for  the  case  of  a  spherical  object 
which  is  illuminated  by  a  point  light  source  at  some 
distance  d  and  viewed  by  a  camera  from  another 
point  at  the  same  distance  from  the  object.  If  the 


angle  between  the  illumination  and  the  viewing  direc¬ 
tion  becomes  too  wide,  the  object  point  with  (globally) 
maximal  body  reflection  becomes  invisible  to  the 
camera.  For  wider  phase  angles,  the  (local)  max¬ 
imum  in  body  reflection,  mbLocalMax,  decreases  as  the 
phase  angle  increases.  We  have  calculated  the 
relationship  between  mbLocalMax  and  the  amount  of 
body  reflection  under  the  highlight,  mmH,  for  varying 
illumination  geometries.  Under  reasonable  imaging 
conditions,  mbH  has  a  value  that  is  more  than  48%  of 
mb.LocaiMax-  The  highlight  line  thus  generally  starts  in 
the  upper  50%  of  the  matte  line.  A  more  detailed 
analysis  will  be  presented  in19.  We  exploit  this 
property  as  the  50%-heuristic  in  our  segmentation 
method. 

4.  A  Camera  Model 

The  previous  sections  have  described  light  reflec¬ 
tion  in  a  theoretical,  physical  model.  However,  the 
observed  pixels  values  are  also  influenced  by  the 
characteristics  of  the  recording  camera.  Since  some 
of  these  influences  disturb  the  light  reflection 
properties  stated  in  the  Dichromatic  Reflection  Model, 
we  need  to  provide  methods  that  restore  the  physical 
properties  of  light  reflection.  Where  this  is  impossible, 
our  image  analysis  algorithms  must  be  able  to  detect 
or  tolerate  the  inaccuracies  in  the  image  data.  This 
section  briefly  describes  how  some  camera  charac¬ 
teristics  influence  the  pixel  values  in  real  images.  A 
more  detailed  analysis,  including  color  pictures  with 
color  clusters  from  real  images,  can  be  found  in7. 

The  Dichromatic  Reflection  Model  describes  light 
reflection  using  the  continuous  spectrum  of  light. 
When  a  sensing  device  such  as  a  camera  or  the 
human  eye  records  an  image,  the  light  is  integrated 
over  the  spectrum  using  a  small  number  of  weighting 
functions  such  as  color  filters.  This  process  of 
spectral  integration  sums  the  amount  of  incoming 
light,  L(k,i,e,g),  weighted  by  the  spectral  transmittance 
of  the  filter,  xf  (k),  and  the  spectral  responsivity  of  the 
camera,  s(k),  over  all  wavelengths 

Cy=  J  L(kJ,e.i;)XJ(\)x(X)(lk  (3) 

We  use  a  red,  a  green  and  a  blue  color  filter 
(Wratten  filters  number  25,  58  and  47A),  thus  reduc¬ 
ing  the  infinite-dimensional  vector  space  to  a  three- 
dimensional  space.  The  spectrum  of  an  incoming 
light  beam  at  pixel  position  (x,y)  is  represented  by  a 
triple  C (x,y)  =  [R,G,B],  where  /',  e,  and  g  are  deter¬ 
mined  by  x  and  y  and  by  the  position  of  the  object 
relative  to  the  camera. 

Spectral  integration  is  a  linear  transformation20'17. 
For  this  reason,  the  linear  relationship  between 
reflected  light  and  the  colors  of  surface  and  body 
reflection,  as  stated  in  equation  (2),  is  maintained  un¬ 
der  spectral  integration.  We  thus  obtain  the 
Dichromatic  Reflection  Model  for  the  three- 
dimensional  color  space: 


C(-C,}')  =  ms(i,e,g)  Cs  +  mb(i,e,g)  Cfa  (4) 

The  color  pixel  value  C (x,y)  is  thus  a  linear  com¬ 
bination  of  the  two  color  vectors,  Cs  =  [RS.GS,BJ  and 
Cfa  =  [RfrGfrBJ,  which  describe  the  colors  of  surface 
and  body  reflection  on  an  object  in  the  scene.  Within 
the  three-dimensional  color  space,  Cs  and  Cb  span  a 
dichromatic  plane  which  contains  a  parallelogram  in 
which  the  color  pixel  values  lie. 

We  will  now  briefly  describe  a  few  camera  limita¬ 
tions  that  are  important  factors  when  images  of 
scenes  with  highlights  are  taken. 

•  Real  cameras  have  only  a  limited 
dynamic  range  to  sense  the  brightness  of 
the  incoming  light.  This  restricts  our 
analysis  of  light  reflection  to  a  color  cube, 
as  shown  in  Figure  4-1 .  If  the  incoming 
light  is  too  bright  at  some  pixel  position, 
the  camera  cannot  sense  and  represent  it 
adequately  and  the  light  is  clipped  in  one 
or  more  color  bands.  Color  clipping  can 
be  a  problem  for  measuring  the  color  of 
highlights  or  bright  objects.  In  the  color 
histograms,  it  causes  the  clusters  to  bend 
along  the  walls  and  edges  of  the  color 
cube  (see  Figure  4-1).  Such  dipped 
color  pixels  do  not  follow  the  characteris¬ 
tics  of  the  dichromatic  reflection  model 
and  must  thus  be  distinguished  from 
matte  pixels  and  highlight  pixels. 

•  If  a  CCD-camera  is  used  to  obtain  the 
images,  too  much  incident  light  at  a  pixel 
may  completely  saturate  the  sensor  ele¬ 
ment  at  this  position21.  This  causes 
blooming  in  the  camera22  as  a  result  of 
which  adjacent  pixels  increase  their 
values  proportional  to  the  magnitude  of 
the  overload.  We  call  such  neighboring 
pixels  bloomed  color  pixels.  We  suspect 
color  values  in  the  upper  10%  of  the  in¬ 
tensity  scale  in  our  images  to  be  poten¬ 
tially  influenced  by  color  clipping  or 
blooming. 

•  Cameras  are  generally  much  less  sen¬ 
sitive  to  blue  light  than  to  red  light.  In 
order  to  provide  an  equal  scaling  on  the 
three  color  axes  in  the  color  space,  we 
need  to  rescale  the  pixel  data  separately 
in  the  color  bands.  We  refer  to  this  pro¬ 
cedure  as  color  balancing.  We  balance 
the  color  bands  by  controlling  the  camera 
aperture  during  the  picture  taking 
process,  using  apertures  for  greer  and 
blue  exposures  under  tungsten  light  that 
are  1/2  and  1  1/2  f-stops  higher  than  the 
aperture  used  for  the  red  exposure.  We 
also  use  a  total  IR  suppressor  (Corion 


FR-400)  in  front  of  our  camera  to 
eliminate  the  CCD-camera’s  sensitivity  to 
infrared  light. 

•  The  color  pixels  also  depend  on  the 
camera  response  to  incident  light  flux. 
Due  to  gamma-correction,  the  camera 
output  is  generally  related  by  an  inverse 
power  law  to  the  incident  flux.  It  intro¬ 
duces  curvature  into  the  color  space,  dis¬ 
torting  the  linear  properties  of  the 
Dichromatic  Reflection  Model.  To 
linearize  the  color  data,  we  fit  interpolat¬ 
ing  cubic  splines  to  the  measured 
responsivity  data,  separately  in  each 
color  band,  as  suggested  by  LeClerc23. 
To  model  very  bright  light  reflection,  as 
can  occur  in  highlights,  we  rescale  the 
linearization  function  and  extrapolate  it 
from  the  brightest  measurement  out¬ 
wards  by  a  square  root  curve.  We  use 
these  functions  to  generate  a  look-up 
table  relating  the  measured  intensities  in 
every  color  band  to  the  incident  flux. 


Figure  4-1 :  Color  cluster  in  the  color  cube,  with 
color  clipping 
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5.  Color  Image  Segmentation 

We  will  now  describe  the  automatic  segmentation 
method  that  we  have  developed  based  on  the 
Dichromatic  Reflection  Model  and  the  camera  model 
described  above.  The  goal  of  segmentation  is  to 
identify  objects  in  an  image,  as  delimited  by  material 
boundaries.  Because  most  current  color  segmen¬ 
tation  methods  are  based  on  a  very  simple  interpreta¬ 
tion  of  color  changes  in  an  image,  they  generally  seg¬ 
ment  images  not  only  along  material  boundaries  but 
also  along  other  lines  exhibiting  color  or  intensity 
variations,  such  as  highlight  and  shadow  boundaries, 
or  object  edges  with  significant  shading  changes.  The 
Dichromatic  Reflection  Model  provides  a  more  sophis¬ 
ticated  interpretation  scheme  relating  the  physics  of 
light  reflection  to  color  changes  in  the  image  and  in 
the  color  space.  Our  segmentation  algorithm  uses  the 
Dichromatic  Reflection  Model  and  is  thus  able  to  dis¬ 
tinguish  significant  color  changes  at  material  boun¬ 
daries  from  insignificant  ones  that  are  due  to  shading 
changes  or  highlights.  The  algorithm  generates 
hypotheses  about  skewed  T's  in  the  color  cube  while 
it  analyzes  local  color  variations  in  the  image.  It  then 
uses  the  hypothesized  orientations  of  the  color 
clusters  to  decide  whether  a  color  change  is  consis¬ 
tent  with  the  local  skewed-T  model  or  whether  it  con¬ 
stitutes  a  material  change. 

5.1.  Generating  Initial  Hypotheses  about 
Color  Clusters 

We  may  use  either  a  global  (top-down)  or  a  local 
(bottom-up)  approach  to  determine  skewed  T's  in  the 
color  histogram  and  the  associated  regions  in  the  im¬ 
age.  A  global  algorithm  projects  the  entire  image  into 
the  color  space  and  then  applies  some  analysis  tech¬ 
nique  to  the  color  space  to  identify  and  distinguish 
between  the  various  clusters.  A  local  algorithm,  on 
the  other  hand,  starts  out  with  small  image  areas  and 
merges  areas  with  consistent  interpretations  into 
larger  areas,  assuming  that  local  color  variation  is  re¬ 
lated  to  only  a  few  scene  properties.  The  global  ap¬ 
proach  has  problems  when  several  objects  with  very 
similar  colors  exist  in  the  scene;  a  local  approach,  on 
the  other  hand,  has  the  problem  of  distinguishing 
camera  noise  from  systemahc  color  variation. 

Figure  5-1  shows  a  scene  with  eight  plastic  objects 
under  white  light.  The  scene  contains  a  cup  and  two 
donuts  at  various  shades  of  red,  a  cup  and  a  donut  in 
different  yellow  colors,  a  green  cup  and  a  green 
donut,  and  a  blue  donut.  The  upper  left  quarter  dis¬ 
plays  the  original  image.  The  color  histogram  from 
the  entire  image  is  shown  in  the  upper  right  quarter. 
Note  that  the  various  clusters  overlap  significantly.  In 
such  an  image,  we  feel  that  it  is  harder  to  decide 
when  to  split  a  large  region  with  overlapping  color 
clusters  than  to  accomodate  for  camera  noise.  There¬ 
fore,  we  chose  to  utilize  a  local  (region  growing) 
scheme. 


We  start  by  dividing  the  image  into  small,  non¬ 
overlapping  windows  of  a  given  size  (typically  10x10 
pixels).  We  project  the  color  pixels  from  one  window 
at  a  time  into  the  color  space  and  find  the  principal 
components  of  the  color  distributions,  as  indicated  by 
the  eigenvectors  and  eigenvalues  of  the  covariance 
matrix  of  the  cluster24’ 25.  We  sort  the  eigenvalues  by 
decreasing  size.  The  eigenvalues  and  eigenvectors 
determine  the  orientation  and  extent  of  the  ellipsoid 
that  optimally  fits  the  data. 


Figure  5-1 :  Initial  analysis  of  color  variations  for  a 
scene  with  eight  plastic  objects 

The  shape  of  the  ellipsoids  provides  information 
that  we  can  use  to  relate  the  local  color  variation  to 
physical  interpretations.  The  classification  is  based 
on  the  number  of  eigenvalues  that  are  approximately 
zero,  within  the  limit  of  pixel  value  noise  in  the  image. 
In  order  to  classify  the  ellipsoids,  we  compare  their 
eigenvalues  with  the  estimated  amount  of  noise,  o0. 
For  our  camera,  we  use  an  experimentally  determined 
estimate  of  o0  =  2.5 .  We  base  our  decision  for  each 
eigenvalue  on  a  x2-test  with  (n2-1)  parameters,  where 
n  is  the  window  size,  and  we  use  a  confidence  level  of 
a„  =0.005.  We  then  classify  each  cluster  according  to 
how  many  eigenvalues  are  determined  to  be  sig¬ 
nificantly  greater  than  zero. 

•  In  zero-dimensional  ( pointlike )  clusters, 
all  three  eigenvalues  of  the  window  are 
very  small  and  the  window  probably  lies 
on  a  single  object  which  is  either  very  flat 
or  very  dark. 
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•  One-dimensional  ( linear )  clusters  are 
clusters  for  which  only  the  first  eigen¬ 
value  is  significantly  larger  than  the  es¬ 
timated  camera  noise.  Pixels  in  such  a 
window  may  stem  from  a  matte  object 
area  or  from  the  interior  of  a  highlight 
area  such  that  they  form  part  of  a  matte 
or  highlight  cluster.  The  pixels  may  also 
come  from  two  point  clusters  if  a  window 
overlays  parts  of  two  neighboring  objects. 

•  Two-dimensional  (planar)  clusters  have 
large  first  and  second  eigenvalues,  and 
the  local  color  data  thus  fits  a  plane  in  the 
color  cube.  Such  clusters  occur  at  win¬ 
dows  that  either  cover  some  matte  and 
some  highlight  pixels  of  one  object  or  that 
overlay  matte  pixels  from  two  neighboring 
regions. 

•  In  three-dimensional  (volumetric) 
clusters,  all  three  eigenvalues  are  large. 

Such  color  clusters  may  arise  in  the  mid¬ 
dle  of  highlights  where  color  clipping  and 
blooming  significantly  increase  the  noise 
in  the  pixel  measurements  Volumetric 
color  clusters  also  occur  along  material 
boundaries  when  three  or  more  objects  in 
different  colors  share  a  window  or  when  a 
window  overlays  matte  pixels  of  one  ob¬ 
ject  and  matte  and  highlight  pixels  of 
another  object. 

The  lower  left  quarter  in  Figure  5-1  shows  the  clas¬ 
sification  of  the  color  clusters  from  the  initial  10x10 
windows.  Pointlike  clusters  are  displayed  in  black, 
linear  clusters  in  dark  grey,  planar  clusters  in  light 
grey,  and  volumetric  clusters  in  white.  The  image 
shows  that  the  classifications  relate  in  the  expected 
way  to  scene  properties.  Most  matte  object  areas  are 
covered  by  linear  windows,  while  windows  at  material 
boundaries  and  at  highlights  are  planar  or  volumetric. 

Next,  we  merge  neighborin  •  windows  that  have 
similar  color  characteristics.  The  algorithm  proceeds 
in  row-major  order.  For  every  area,  it  tests  for  every 
neighbor  whether  the  two  can  be  merged.  In  order 
not  to  merge  windows  across  material  boundaries,  we 
only  merne  neighboring  windows,  if  botti  of  them  and 
the  resulting  larger  window  all  have  the  same  clas¬ 
sification.  We  use  the  x?-test  presented  above  to 
classify  the  larger  window.  Accordingly,  we  combine 
windows  with  pointlike  clusters  into  larger  pointlike 
windows;  we  merge  windows  with  linear  characteris¬ 
tics  into  larger  linear  windows;  and  we  merge  planar 
windows  ii  o  larger  planar  windows.  We  do  not 
merge  neighboring  volumetric  color  clusters  since 
there  is  no  constraint  on  the  resulting  cluster.  We 
continue  this  process  until  no  more  areas  can  be 
merged.  The  results  are  initial  hypotheses  about  the 
positions  and  orientations  of  pointlike,  linear,  and 


planar  clusters  in  color  space  and  their  respective  ap¬ 
proximate  extents  in  the  image.  The  lower  right 
quarter  in  Figure  5-1  presents  the  results  of  merging 
neighboring  windows  of  the  same  class  in  the  image 
of  the  eight  plastic  objects. 

5.2.  Exploiting  Linear  Hypotheses 

So  far,  the  algorithm  has  used  a  bottom-up  ap 
proach  to  extract  information  about  the  scene  from  the 
image.  In  a  top-down  step,  the  algorithm  now  uses 
the  generated  hypotheses  to  segment  the  image.  We 
start  by  using  the  hypothesis  with  the  largest  linear 
cluster.  The  chosen  linear  hypothesis  provides  us 
with  a  model  of  what  color  variation  to  expect  in  the 
associated  image  area.  The  mean  value  and  the  first 
eigenvector  describe  the  position  and  orientation  of  a 
linear  color  cluster,  while  the  second  and  third  eigen¬ 
values  determine  the  extent  of  the  color  cluster  per¬ 
pendicular  to  the  major  direction  of  variation.  Accord 
ing  to  the  Dichromatic  Reflection  Model,  we  attribute 
color  variation  along  the  major  axis  to  a  physical 
property  of  the  scene,  such  as  a  changing  amount  of 
body  or  surface  reflection  or  a  material  boundary. 


Figure  5-2:  Resolving  a  color  conflict  between  two 
matte  color  clusters 

Our  model  assumes  that  color  variation  perpen¬ 
dicular  to  the  first  eigenvector  is  caused  by  random 
noise.  We  thus  approximate  a  linear  color  cluster  by 
a  cylinder.  The  radius  depends  on  the  estimated 
camera  noise;  we  generally  choose  a  constant  mul¬ 
tiple  of  o0  (typically  4on).  We  exclude  dark  color 
pixels  from  our  color  analysis  because  the  matte 
clusters  merge  near  the  dark  corner  of  the  color  cube. 
The  cylinder  is  thus  delimited  at  its  dark  end  by  a 
sphere  which  is  centered  at  the  black  corner.  We 
typically  use  a  radius  for  the  sphere  ol  23,  which  is 
approximately  10%  of  the  maximum  pixel  value. 

We  then  use  the  linear  hypothesis  to  locally  reseg¬ 
ment  the  image.  We  select  a  start  pixel  from  the  area 
associated  with  the  color  cluster:  this  pixel  must  have 
a  color  that  is  contained  within  the  color  cylinder  of  the 
current  hypothesis  We  then  examine  its  lour  neigh¬ 
bors,  testing  whether  their  colors  lie  on  the  cylinder.  If 
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so,  we  recursely  examine  the  next  fringe  of  neighbors 
and  so  on.  We  thus  grow  a  region  of  four-connected 
image  pixels  that  are  consistent  with  the  current 
hypothesis. 

Since  the  matte  clusters  converge  towards  the  dark 
corner  of  the  color  cube,  there  exists  a  potential  con¬ 
flict  between  neighboring  matte  areas.  Our  heuristic 
of  excluding  very  dark  pixels  from  the  analysis 
eliminates  the  most  difficult  cases;  still,  the  cylinders 
of  neighboring  clusters  may  intersect  beyond  the 
selected  dark  threshold.  This  depends  on  the  cylinder 
radius  and  on  the  angle  between  the  two  cylinders. 
Thus,  neighboring  objects  with  similar  colors  will  have 
conflicts  even  at  fairly  bright  colors.  We  assign  pixels 
with  a  color  conflict  to  the  cluster  with  the  closest  axis, 
as  shown  in  Figure  5-2. 

Since  there  is  generally  more  than  one  object  in  the 
scene,  our  algorithm  iterates  the  above  steps  for  each 
image  area  with  a  large  linear  cluster,  selecting  the 
areas  by  decreasing  size.  It  stops  when  the  next 
selected  area  is  too  small  (typically  less  than  500 
pixels). 


Figure  5-3:  Segmentation  steps  of  the  scene  with 
eight  plastic  objects 

In  principle,  the  resulting  "linear"  regions  may  be 
related  in  any  of  several  different  ways  to  the  physical 
processes  in  the  scene.  As  discussed  in  section  5.1 ., 
a  linear  color  cluster  may  be  a  matte  cluster  or  a 
highlight  cluster  or  even  a  combination  of  two  clusters 
across  a  material  boundary.  However,  we  expect  that 


linear  color  clusters  from  highlights  and  across 
material  boundaries  are  generally  much  smaller  than 
matte  clusters.  Since  we  use  only  hypotheses  con¬ 
nected  with  large  linear  clusters,  we  assume  in  the 
following  that  all  of  the  linear  regions  correspond  to 
matte  linear  clusters. 

The  upper  right  quarter  of  Figure  5-3  shows  the 
results  of  selecting  and  applying  linear  hypotheses  to 
the  image  with  the  eight  plastic  objects.  The  region 
boundaries  outline  the  matte  object  parts  in  the  scene, 
with  the  material  boundaries  being  well  observed. 
The  highlight  areas  on  the  objects  have  not  yet  been 
analyzed.  This  sometimes  divides  the  matte  pixels  on 
an  object  into  several  matte  areas,  as  shown  on  the 
middle  cup  and  on  the  lower  right  donut. 

5.3.  Generating  Planar  Hypotheses 

We  will  now  extend  the  above  generated  linear 
hypotheses  into  planar  hypotheses  that  describe 
dichromatic  planes  and  skewed  T’s.  For  every  linear 
region,  we  examine  all  its  neighbors  to  determine  a 
prospective  highlight  candidate.  In  order  to  decide 
whether  a  neighboring  region  comes  from  another  ob¬ 
ject  ( across  a  material  boundary)  or  whether  it  is  a 
highlight  on  the  same  object,  we  test  whether  both 
linear  clusters  point  in  positive  directions  and  whether 
the  two  clusters  form  a  skewed  T  and  meet  in  the 
upper  50%  of  the  matte  cluster.  A  similar  idea  was 
used  by  Gershon8  to  identify  highlight  segments  in  a 
postprocessing  step  after  an  initial,  traditional  seg¬ 
mentation  was  performed.  He  tested  whether  the  two 
clusters  are  nearly  parallel  cr  whether  they  intersect, 
suspecting  that  parallel  clusters  are  neighboring  matte 
clusters.  In  our  method,  the  physical  model  is  used 
during  the  segmentation  process  rather  than  as  a 
post-processing  step. 

Each  candidate  cluster  that  we  test  is  a  group  of 
pixels  on  the  border  of  a  linear  region.  If  the  cluster  is 
in  fact  a  highlight,  the  first  eigenvector  of  the  cluster's 
color  distribution  will  generally  be  in  the  direction  of 
the  highlight  color.  We  start  by  testing  whether  the 
first  eigenvector  of  a  highlight  candidate  describes  a 
positive  color  change,  i.e:  dR  >  0  and  dG  >  ()  and 
dB  >  0.  If  this  is  not  the  case,  the  window  of  the 
highlight  candidate  probably  overlays  a  material 
boundary  and  thus  partially  covers  pixels  from  the  cur¬ 
rent  matte  region  and  partially  covers  pixels  from  a 
neighboring  matte  region. 

Next,  we  exploit  the  T-shape  of  the  dichromatic 
cluster  and  the  50%-heuristic.  For  this  purpose,  we 
determine  the  brightest  matte  point  in  the  color 
cluster.  In  order  not  to  select  a  highlight  point  that  fell 
into  the  matte  cylinder,  as  shown  in  Figure  5-4,  we 
exploit  the  observation  that  highlight  clusters  always 
grow  inwards  into  the  color  cube,  due  to  the  additive 
nature  of  body  and  surface  reflection.  We  thus 
choose  the  brightest  matte  point  only  from  pixels  with 
color  values  that  are  closer  to  the  walls  of  the  cube 


than  the  matte  line  is.  Next,  we  determine  the  inter¬ 
section  points  of  the  two  clusters.  We  use  their  color 
means  and  first  eigenvectors  to  describe  a  matte  and 
a  prospective  highlight  line  and  determines  the  two 
points  on  the  two  lines  that  are  closest  to  one  another. 
If  the  distance  between  them  is  larger  than  a  multiple 
of  the  estimated  camera  noise  (typically  4ou),  we 
decide  that  the  clusters  do  not  meet  in  a  skewed  T 
and  the  neighboring  area  is  discarded.  Similarly,  if 
the  clusters  intersect  in  the  lower  50%  of  the  matte 
cluster,  we  suspect  that  we  are  looking  at  two  matte 
clusters  from  neighboring  objects,  and  we  discard  this 
neighbor. 


Figure  5-4:  Finding  the  brightest  matte  pixel 

There  may  be  several  highlight  candidates  because 
the  highlight  on  the  object  may  consist  of  a  series  of 
small  windows  that  could  not  be  merged,  due  to  color 
clipping  and  blooming  in  the  middle  of  the  highlight. 
In  order  to  select  a  good  representative  of  the  entire 
highlight,  we  average  the  intersection  points  of  all 
highlight  candidates,  weighted  by  the  number  of  pixels 
in  the  various  regions.  We  then  select  the  highlight 
region  whose  intersection  point  is  closest  to  the 
average. 

Since  we  assume  in  the  Dichromatic  Reflection 
Model  that  highlights  have  the  same  color  as  the  il¬ 
lumination,  all  highlight  clusters  are  parallel  to  one 
another  and  to  the  illumination  color  vector.  As  a 
consequence,  all  dichromatic  planes  intersect  along 
one  line,  which  is  the  illumination  vector.  We  use  this 
constraint  to  further  reduce  the  error  estimating  the 
orientations  of  the  highlight  clusters,  combining  all 
highlight  hypotheses  into  a  single  hypothesis  about 
the  color  of  the  illumination.  Observing  that  any  im¬ 
precision  of  the  highlight  vector  orientation  within  the 
dichromatic  plane  is  irrelevant  we  base  our  computa¬ 
tion  on  the  normals  n(  of  the  dichromatic  planes  of  all 
objects.  We  then  determine  the  intersection  vector  i,  ^ 
between  each  pair  of  planes: 


i|)  =  n|xn|  (5) 


Each  intersection  vector  i,  j  provides  us  with  a  vector 
describing  a  hypothesized  illumination  color.  The 
strength  of  the  hypothesis  depends  on  the  angle  be¬ 
tween  the  two  dichromatic  planes  under  considera¬ 
tion.  If  the  planes  intersect  nearly  perpendicularly,  the 
intersection  line  will  not  vary  much  with  small  amounts 
of  error  in  estimating  the  orientations  of  the  planes.  If, 
on  the  other  hand,  the  planes  are  nearly  parallel, 
small  amounts  of  noise  may  arbitrarily  influence 
whether  and  where  the  planes  intersect.  We  use  a 
weighted  average  of  all  intersection  vectors  to 
generate  a  hypothesis  on  the  illumination  color,  in 
which  every  intersection  vector  contributes  according 
to  the  dihedral  angle  enclosed  by  the  respective 
dichromatic  planes. 

5.4.  Exploiting  Planar  Hypotheses 

The  Dichromatic  Reflection  Model  states  that  color 
variations  on  a  single  object  lie  in  a  plane  in  the  color 
space.  We  therefore  use  the  planar  hypotheses  to 
resegment  the  image,  expecting  that  the  resulting  im¬ 
age  areas  cover  entire  objects  and  will  not  be  inter¬ 
rupted  by  highlight  boundaries.  We  select  one 
hypothesis  at  a  time  and  apply  it  to  the  image, 
proceeding  iteratively  until  no  more  unprocessed 
hypotheses  with  sufficient  support  from  the  image  ex¬ 
ist. 

The  chosen  matte  cluster  and  the  illumination 
hypothesis  determine  a  planar  hypothesis  predicting 
all  significant  color  variation  on  the  associated  object. 
We  use  the  cross  product  of  the  illumination  vector 
and  the  first  eigenvector  ol  the  matte  cluster  to  deter¬ 
mine  the  normal  to  this  dichromatic  plane.  We  use 
the  color  mean  of  the  matte  cluster  to  position  the 
plane  in  the  color  cube,  and  we  extend  the  plane  into 
a  slice  of  fixed  thickness  to  account  for  camera  noise. 
A  typical  choice  for  the  width  of  the  slice  is  4o(|  in  the 
positive  and  negative  direction  of  the  normal. 

The  algorithm  uses  the  chosen  planar  hypotheses 
to  locally  resegment  the  image.  In  principle,  we  start 
from  the  selected  matte  region  and  expand  it,  recur¬ 
sively  including  pixels  at  the  region  boundaries  which 
are  consistent  with  the  planar  hypothesis.  However, 
the  algorithm  must  be  augmented  with  special  provi¬ 
sions  to  handle  coplanar  color  clusters  from  neigh¬ 
boring  objects.  If  the  illumination  vector  lies  in  the 
plane  spanned  by  the  matte  vectors  of  two  neigh¬ 
boring  objects,  the  color  clusters  of  the  objects  lie  in 
the  same  dichromatic  plane  and  thus  cannot  be  distin¬ 
guished  by  a  mere  planar  region  growing  approach. 
The  resulting  segmentation  would  generally  be  quite 
counterintuitive  since  the  matte  object  colors  may  be 
very  different  and  even  complementary.  To  avoid 
such  segmentation  errors,  we  exploit  the  previously 
gathered  knowledge  about  existing  matte  color 
clusters.  When  the  planar  region  process  encounters 
pixels  from  a  previously  grown  matte  region  other 
than  the  starting  region,  it  only  continues  growing  it 


the  pixel  lies  within  the  matte  color  cylinder  of  the 
starting  region.  We  thus  apply  the  unrestricted  planar 
growing  criterion  only  to  pixels  that  have  not  been 
previously  recognized  as  matte  pixels,  while  we  fall 
back  to  the  linear  region  growing  method  when  matte 
pixels  are  concerned.  This  reflects  the  observation 
that  if  several  matte  areas  exist  in  an  object  area, 
separated  by  highlight  regions,  all  such  matte  areas 
form  a  single  matte  cluster.  We  also  apply  the 
proximity  heuristic  described  in  the  previous  section  to 
resolve  ambiguities  for  color  pixels  at  the  intersection 
of  dichromatic  planes. 

The  lower  left  quarter  of  Figure  5-3  displays  the 
results  of  segmenting  the  scene  using  the  generated 
planar  hypotheses.  In  comparison  to  the  linear  seg¬ 
mentation  in  the  upper  right  quarter,  the  segmented 
image  areas  have  grown  into  the  highlight  areas.  As 
a  result,  the  two  matte  image  areas  on  the  middle  cup 
that  were  previously  separated  by  a  highlight  are  now 
united.  The  same  occurred  on  the  lower  right  donut. 
Due  to  camera  limitations,  not  all  pixels  in  the  centers 
of  the  highlights  are  yet  integrated  into  the  object 
areas.  This  will  be  discussed  and  remedied  in  the 
next  section. 

5.5.  Accounting  for  Camera  Limitations 

Unfortunately,  real  images  generally  do  not  comply 
with  the  Dichromatic  Reflection  Model.  Color  clipping 
and  blooming  may  distort  the  color  of  pixels  in  and 
around  highlights  significantly.  As  a  result,  the  color 
pixels  in  the  centers  of  highlights  will  generally  not  fall 
into  the  planar  slice  defined  for  the  dichromatic  plane 
of  an  object  area  and  the  planar  segmentation  will 
exclude  such  pixels,  as  can  be  observed  in  the  upper 
right  quarter  of  Figure  5-3. 

Since  color  information  is  so  unreliable  for  these 
pixels,  we  do  not  use  it.  Instead,  we  use  a  geometric 
heuristic  to  include  them  into  the  region.  As  men¬ 
tioned  above,  pixels  with  distorted  colors  generally  oc¬ 
cur  in  the  middle  of  the  highlight  areas.  Thus,  starting 
from  highlight  pixels  we  expand  the  planar  regions 
into  areas  that  are  next  to  highlight  pixels  and  contain 
very  bright  pixels  (i.e:  brighter  than  the  intersection 
point  between  the  matte  and  highlight  cluster). 

The  lower  right  quarter  of  Figure  5-3  displays  the 
results  of  segmenting  the  scene  using  the  generated 
planar  hypotheses.  Nearly  all  pixels  in  the  highlight 
centers  have  now  been  integrated  into  the  segmented 
regions,  and  the  image  segments  correspond  very 
well  to  the  objects  in  the  scene.  Very  few  pixels  on 
the  highlights  have  been  excluded,  due  to  our  heuris¬ 
tic  of  only  intregrating  very  bright  pixels.  Any  image 
processing  method  for  filling  holes  in  image  segments 
should  be  able  to  include  these  pixels.  The  lowest  of 
the  three  donuts  in  the  upper  left  exhibits  two  inter¬ 
esting  properties.  First,  a  small  area  at  its  upper  left 
edge  has  wrongly  been  assigned  to  the  donut  above 
it.  We  can  see  in  the  original  image  that  this  area  is 


covered  by  a  shadow,  resulting  in  very  dark  pixel 
values.  Since  the  color  clusters  of  the  two  donuts 
(yellow  and  red)  are  already  merged  at  these  color 
values,  the  assignment  was  based  on  the  distance 
from  the  two  cylinder  axes,  which  happened  to  be 
smaller  for  the  upper  donut.  Second,  there  is  a  small 
area  in  the  lower  right  part  of  the  donut  that  was  not 
included  into  the  large  image  segment  covering  the 
donut.  The  color  of  these  pixels  has  been  significantly 
altered  by  inter-reflection  from  the  middle  cup  which 
reflected  a  part  of  its  body  color  (orange)  onto  the 
(yellow)  donut,  thus  influencing  the  reflected  colors  of 
the  donut.  Inter-reflection  may  also  be  a  factor  in  the 
shadowy  part  in  the  upper  left  of  the  donut.  This 
demonstrates  the  sensitivity  of  our  algorithm  to  the 
reflection  processes  in  the  scene.  Since  inter¬ 
reflection  is  not  yet  a  part  of  our  reflection  model,  our 
current  algorithm  cannot  identify  it  in  the  image  and 
process  it  correctly. 

6.  Removing  Highlights 

As  one  application  of  our  above  method  to  analyze 
and  segment  color  images,  we  can  now  use  the 
gathered  information  about  the  color  clusters  to  split 
every  color  pixel  into  its  two  reflection  components. 
We  thus  generate  two  intrinsic  images  of  the  scene, 
one  showing  the  objects  as  if  they  were  completely 
matte,  and  the  other  showing  only  the  highlights. 

We  have  previously  reported  a  method  to  detect 
and  remove  highlights  from  hand-segmented  images 

7.  That  method  projected  the  pixels  from  a  selected 
image  area  into  the  color  space  and  fitted  a  skewed  T 
to  the  entire  color  cluster,  thus  determining  the  body 
and  surface  reflection  vectors,  Cb  and  Cs,  of  the  area. 
Since  our  new  segmentation  algorithm  already 
provides  this  information  as  a  result  of  its  analysis  of 
local  color  variations,  we  can  now  skip  this  step. 
However,  due  to  possible  estimation  errors  in  the  seg¬ 
mentation  process  and  due  to  camera  problems  such 
as  blooming  and  chromatic  aberration,  the  vectors 
may  not  yet  fit  perfectly.  In  order  to  obtain  a  more 
precise  fit  to  the  data,  we  retest  every  pixel  in  the 
segmented  area  and  label  it  as  a  matte  or  highlight 
pixel,  depending  on  whether  it  is  closer  to  the  matte 
line  or  to  the  highlight  line,  which  is  the  illumination 
vector  starting  from  the  intersection  point  of  the 
skewed  T.  We  then  refit  the  matte  and  highlight  line  to 
the  matte  and  highlight  pixels  by  determining  the  first 
eigenvectors  and  the  color  means  of  the  clusters. 

The  algorithm  uses  the  reflection  vectors  Cb  and  Cs 
that  are  defined  for  every  segmented  region  to  split 
every  pixel  in  each  region  into  its  two  reflection  com¬ 
ponents.  Cb,  Cs,  and  their  cross  product,  Cb  x  Cs, 
define  a  new  (not  necessarily  orthogonal)  coordinate 
system  in  the  color  cube.  This  coordinate  system 
describes  every  color  in  the  cube  in  teims  of  the 
amounts  of  body  reflection,  surface  reflection  and 
noise  e,  as  given  by  the  color  distance  from  the 


dichromatic  plane  (see  figure  6-1).  There  exists  an 
affine  transformation,  and  thus  a  linear  transformation 
matrix  7,  which  transforms  any  color  vector  c  = 
[ R,G,BJ ]  from  the  initial  coordinate  system  into  a  vec¬ 
tor  d  =  in  the  new  coordinate  system.  After 

computing  7  from  Cb  and  Cs,  we  can  thus  transform 
every  color  pixel  in  the  image  area  into  its  constituent 
body  and  surface  reflection  components,  mb  and  ms. 
By  selecting  the  mb-component  of  every  transformed 
pixel,  we  generate  the  body  reflection  image  of  the 
image  area.  By  selecting  the  ms-component,  we 
generate  the  corresponding  surface  reflection  image. 


ponents  of  the  various  objects  reasonably  well.  The 
orientations  of  the  corresponding  reflection  vectors 
are  shown  in  Table  9-1.  When  evaluated  under 
eyeball  inspection,  the  body  reflection  image  seems  to 
generally  provide  a  smooth  shading  component 
across  the  highlight  area.  We  expect  that  this  image 
may  therefore  be  a  useful  tool  to  determine  object 
shapes  from  shading  information.  We  will  investigate 
this  application  in  the  future.  There  exist  thin  dark 
rings  in  the  body  reflection  image  around  the  highlight 
on  the  lower  right  donut.  This  error  is  due  to 
chromatic  aberration  in  the  camera  which  currently 
limits  the  performance  of  our  algorithm. 

The  surface  reflection  image  exhibits  very  well  the 
highlight  components  on  the  objects.  All  highlights 
have  been  detected,  and  beyond  that,  the  image 
provides  precise  quantitative  information  on  the  vary¬ 
ing  amount  of  surface  reflection.  Notice  the  gradient 
in  the  surface  reflection  component  on  the  rightmost 
cup  and  the  small  amount  of  gloss  that  was  detected 
on  the  handle  of  the  middle  cup.  A  careful  inspection 
of  the  surface  reflection  image  reveals  that  the  sur¬ 
face  reflection  component  also  increases  at  the 
material  boundaries.  This  effect  is  related  to  aliasing 
occuring  at  pixels  that  integrate  light  from  two  neigh¬ 
boring  objects.  The  colors  of  such  pixels  are  a  linear 
combination  of  two  matte  object  colors.  Depending 
on  the  orientations  of  the  dichromatic  planes,  they 
may  be  included  in  either  of  the  two  object  areas  in 
the  image,  or  they  may  be  left  unassigned,  as  at  the 
edge  between  the  middle  and  the  rightmost  cup.  If 
they  are  assigned  to  one  of  the  object  areas,  they 
generally  do  not  lie  close  to  the  matte  line,  thus  result¬ 
ing  in  a  higher  surface  reflection  component. 
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Figure  6-1 :  Decomposing  a  color  pixel  into  its 
constituent  reflection  components 

We  cannot  apply  this  method  to  bloomed  and 
clipped  pixels  because  such  pixels  generally  do  not 
satisfy  the  conditions  of  the  Dichromatic  Reflection 
Model:  since  their  color  is  altered  by  the  sensing 
process  in  the  camera,  they  generally  do  not  lie  on  the 
skewed  T  or  even  in  the  dichromatic  plane.  In  order 
to  restore  the  physically  correct  color  of  such  pixels, 
we  exploit  the  observation  that  in  many  cases,  clip¬ 
ping  and  blooming  occurs  only  in  one  or  two  color 
bands.  The  pixels  may  thus  have  correct  data  in  the 
other  color  bands.  We  assume  that  the  smallest  of 
the  three  values  of  a  color  pixel  stems  from  a  color 
band  without  clipping  and  blooming.  We  then  replace 
the  clipped  or  bloomed  pixel  by  a  pixel  on  the  matte  or 
highlight  line  that  has  the  same  value  in  the  undis¬ 
torted  band. 

The  upper  quarters  in  Figure  6-2  display  the  result¬ 
ing  intrinsic  images  of  the  scene  with  the  eight  plastic 
objects.  The  images  demonstrate  that  we  are  able  to 
determine  the  body  and  surface  reflection  com- 


Figure  6-2:  Intrinsic  reflection  images  of  the  scene 
with  eight  plastic  objects 

7.  Results  and  Discussion 

Figure  7-1  shows  the  results  of  applying  our  seg¬ 
mentation  algorithm  and  the  reflection  analysis  to  a 
scene  of  an  orange,  a  green  and  a  yellow  cup  under 
white  light.  These  same  images  were  analyzed  in  our 
earlier  work,  using  hand-segmentation7.  The  seg- 
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mented  image  in  the  upper  right  quarter  demonstrates 
that  our  method  is  capable  of  correctly  identifying  the 
object  areas.  Due  to  its  modeling  of  matte  and  high¬ 
light  color  variations,  our  program  ignores  color 
changes  along  the  highlight  boundaries.  The  lower 
quarters  in  the  figure  display  the  intrinsic  body  and 
surface  reflection  images.  The  orientations  of  the 
matte  and  highlight  vectors  are  given  in  Table  9-2. 
Table  9-3  shows  the  vector  orientations  of  the  cups 
under  yellow  light.  The  intrinsic  images  and  the 
reflection  vectors  are  comparable  or  better  than  the 
ones  that  we  obtained  earlier  by  fitting  skewed  Ts  to 
the  entire  histogram  of  hand-segmented  image  areas 
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Figure  7-1 :  Analysis  results  for  three  plastic  cups 
under  white  light 

Our  examples  show  that  our  algorithm  analyzes 
and  segments  the  given  images  quite  well.  We  will 
now  discuss  the  conceptual  and  implementational 
limitations  of  our  approach.  First,  our  current  im¬ 
plementation  depends  on  the  window  size  used  for 
determining  the  initial  local  color  variation  in  the  im¬ 
age.  If  the  windows  are  too  large,  many  of  them 
overlap  several  objects,  and  the  clusters  are 
volumetric.  If  the  windows  are  too  small,  color  varia¬ 
tion  on  a  relatively  flat  or  dark  object  may  be  over¬ 
looked,  as  the  result  of  pointlike  clusters.  Second,  our 
cylindrical  model  for  the  linear  hypotheses  does  not 
account  for  secondary  effects  on  pixel  values,  such  as 
color  clipping,  blooming,  chromatic  aberration,  alias¬ 
ing  and  inter-reflection  between  objects.  Because 
such  effects  influence  the  shape  of  the  color  cluster,  a 


non-circular  model  of  the  cross-section  of  linear 
clusters  may  be  needed.  We  deal  with  this  problem 
by  choosing  a  fairly  large  cylinder  radius.  It  may  be 
desirable  to  adaptively  fit  a  tighter  model  to  the  linear 
cluster.  Along  the  same  lines,  the  hypothesized  orien¬ 
tation  of  the  linear  cluster  may  be  inexact  and  could 
be  improved  by  an  adaptive  method  that  refits  the 
cylinder  until  it  optimally  fits  a  given  cluster.  We  are 
currently  investigating  these  issues. 

The  conceptional  limitations  of  our  approach  are  re¬ 
lated  to  the  basic  principles  of  our  model.  Since  we 
attribute  any  linear  color  variation  to  the  changing  il¬ 
lumination  geometry  of  a  single  material,  we  are  un¬ 
able  to  find  material  boundaries  between  objects  with 
collinear  matte  clusters.  We  will  need  a  geometrical 
analysis,  linking  intensity  gradients  to  object  shape,  to 
distinguish  between  such  objects.  The  same  will  be 
needed  to  analyze  dark  image  areas  which  are  cur¬ 
rently  exluded  because  their  color  information  is  too 
unreliable. 

The  model  also  makes  simplifying  assumptions 
about  the  illumination  conditions  and  the  materials  in 
the  scene.  A  color  cluster  from  a  single  object  in  an 
unconstrained  scene  will  generally  not  be  a  skewed  T 
composed  of  linear  subclusters  because  the  illumina¬ 
tion  color  may  vary  on  different  parts  on  the  object 
surface,  and  the  reflection  properties  of  the  object 
may  also  change,  due  to  pigment  variations  in  the 
material  body.  The  necessary  extensions  to  the 
model  will  be  the  subject  of  future  work. 

Furthermore,  our  method  to  split  color  pixels  into 
their  reflection  components  (but  not  the  segmentation) 
relies  on  a  characteristic  color  change  between  the 
matte  object  color  and  the  highlight  color.  There 
needs  to  be  a  certain  angle  between  the  orientations 
of  the  body  and  surface  reflection  vectors  of  an  object. 
How  big  the  angle  needs  to  be  depends  on  the  es¬ 
timated  camera  noise  (as  related  to  the  width  of  the 
matte  cylinder).  If  the  matte  and  highlight  clusters  are 
approximately  collinear,  we  cannot  separate  the 
reflection  components.  Similarly,  we  have  problems 
when  one  of  the  two  linear  clusters  does  not  exist  or  is 
very  small  for  an  object.  The  matte  cluster  is  missing, 
if  the  viewed  object  is  very  dark,  or  if  the  scene  is 
illuminated  with  a  narrow-band  illuminant  that  does 
not  overlap  with  the  wavelengths  at  which  the  material 
reflects  light.  Matte  clusters  also  do  not  exist  tor 
metallic  objects.  On  the  other  hand,  the  highlight 
cluster  may  be  missing,  if  an  object  does  not  reflect  a 
highlight  into  the  camera,  due  to  its  position  in  the 
scene  and  the  illumination  geometry.  As  a  tnird  case, 
we  need  to  consider  objects  with  very  rough  surfaces 
such  that  every  pixel  in  the  image  area  has  both  a 
significant  body  and  surface  reflection  component. 
The  color  cluster  may  then  fill  out  the  entire 
dichromatic  plane.  A  common  special  case  of  this  are 
so-called  "matte”  or  ’’lambertian"  materials  -  as  op¬ 
posed  to  glossy  materials  -  which  reflect  a  constant 


amount  of  surface  reflection  in  every  direction  and 
thus  never  exhibit  a  highlight  in  the  common  sense  of 
the  word.  The  corresponding  color  clusters  are  linear 
clusters,  translated  from  the  origin  of  the  color  space 
according  to  the  constant  surface  reflection  com¬ 
ponent.  Our  current  method  is  not  capable  of  distin¬ 
guishing  between  all  these  cases  that  result  in  a 
single,  linear  cluster.  In  combination  with  exploiting 
previously  determined  scene  properties,  such  as  the 
illumination  color,  we  will  need  to  analyze  the  intensity 
gradients  along  the  linear  axes  and  relate  them  to  the 
properties  of  ms  and  mb,  as  described  in  a  geometri¬ 
cal  model  of  light  reflection. 

8.  Conclusions 

In  this  paper,  we  have  demonstrated  that  it  is  pos¬ 
sible  to  analyze  and  segment  real  color  images  by 
using  a  physics-based  color  reflection  model.  Our 
model  accounts  for  highlight  reflection  and  matte 
shading,  as  well  as  for  some  characteristics  of 
cameras.  By  developing  a  physical  description  of 
color  variation  in  color  images,  we  have  developed  a 
method  to  automatically  segment  an  image  while 
generating  hypotheses  about  the  scene.  We  then  use 
the  knowledge  we  have  gained  to  separate  highlight 
reflection  from  matte  object  reflection.  The  resulting 
intrinsic  reflection  images  have  a  simpler  relationship 
to  the  illumination  geometry  than  the  original  image 
and  may  thus  improve  the  results  of  many  other  com¬ 
puter  vision  algorithms,  such  as  motion  analysis, 
stereo  vision,  and  shape  from  shading  or  highlights 
26, 9, 4,  27  sjnce  (he  surface  reflection  component  of 
dielectric  materials  generally  has  the  same  color  as 
the  illumination,  we  can  also  determine  the  illumina¬ 
tion  color  from  the  intrinsic  surface  reflection  image, 
information  which  is  needed  by  color  constancy  al¬ 
gorithms28’  29' 30’31. 

The  key  points  leading  to  the  success  of  this  work 
are  our  modeling  of  highlights  as  a  linear  combination 
of  both  body  and  surface  reflection  and  our  modeling 
of  the  camera  properties.  With  few  exceptions 
28, 8, 30, 5,  previous  work  on  image  segmentation  and 
highlight  detection  has  assumed  that  the  color  of  high¬ 
light  pixels  is  completely  unrelated  to  the  object  color. 
This  assumption  would  result  in  two  unconnected 
clusters  in  the  color  space:  one  line  or  ellipsoid 
representing  the  object  color  and  one  point  or  sphere 
representing  the  highlight  color.  Our  model  and  our 
color  histograms  demonstrate  that,  in  real  scenes,  a 
transition  area  exists  on  the  objects  from  purely  matte 
areas  to  the  spot  that  is  generally  considered  to  be 
the  highlight.  This  transition  area  determines  the 
characteristic  shapes  of  the  color  clusters  which  is  the 
information  that  we  use  to  distinguish  highlight  boun¬ 
daries  from  material  boundaries  and  to  detect  and 
remove  highlights.  This  view  of  highlights  should 
open  the  way  for  quantitative  shape-from-gloss 
analysis,  as  opposed  to  binary  methods  based  on 
thresholding  intensity. 


By  modeling  the  camera  properties,  we  are  able  to 
obtain  high  quality  color  images  (through  color  balanc¬ 
ing  and  spectral  linearization)  in  which  most  pixels 
maintain  the  linear  properties  of  light  reflection,  as 
described  in  the  Dichromatic  Reflection  Model.  We 
can  also  detect  most  distorted  color  pixels  in  an  image 
and  thus  generate  an  intrinsic  error  image  which  then 
guides  our  algorithm  to  separate  only  undistorted 
color  pixels  into  their  reflection  components.  We  ex¬ 
pect  that  the  intrinsic  error  image  will  be  similarly  use¬ 
ful  in  guiding  other  computer  vision  algorithms,  such 
as  shape  from  shading.  It  may  also  enable  us  to 
automatically  control  the  camera  aperture  so  that  we 
can  obtain  color  images  with  minimal  clipping  and 
blooming. 

Our  hypothesis-based  approach  towards  image 
segmentation  may  provide  a  new  paradigm  for  low- 
level  image  understanding.  Our  method  gains  its 
strength  from  using  an  intrinsic  model  of  physical 
processes  that  occur  in  the  scene.  The  result  are 
intrinsic  images  and  hypotheses  which  are  closely  re¬ 
lated  in  their  interpretation  to  the  intrinsic  model,  being 
instantiations  of  concepts  formulated  in  the  model. 
Our  system  alternates  between  a  bottom-up  step 
which  generates  hypotheses  and  a  top-down  step 
which  applies  the  hypotheses  to  the  images.  Our 
analysis  thus  consists  of  many  small,  complete  inter¬ 
pretation  cycles  that  combine  bottom-up  processing 
with  feed-back  in  top-down  processing.  This  ap¬ 
proach  stands  in  contrast  to  traditional  image  seg¬ 
mentation  methods  which  do  not  relate  their  analysis 
to  intrinsic  models  and  that  also  generally  have  a 
strictly  bottom-up  control  structure.  We  feel  that  many 
low-level  image  understanding  methods  such  as 
shape-from-x  methods,  stereo  and  motion  analysis 
may  be  viewed  and  approached  under  this  paradigm. 
We  hope  to  extend  our  approach  into  a  more  com¬ 
plete  low-level  image  analysis  system  which  com¬ 
bines  color  analysis  with  a  geometrical  analysis  of  the 
scene,  exploiting  the  body  and  surface  reflection 
images.  Along  these  lines,  we  may  generate 
hypotheses  about  object  shapes  and  about  the  object 
materials29.  The  highlight  image  may  also  provide 
strong  evidence  for  the  position  of  the  light  source. 

Although  the  current  method  has  only  been  applied 
in  a  laboratory  setting,  its  success  shows  the  value  of 
modeling  the  physical  nature  of  the  visual  environ¬ 
ment.  Our  work  and  the  work  of  others  in  this  area 
may  lead  to  methods  that  will  free  computer  vision 
from  its  current  dependence  on  statistical  signal- 
based  methods  for  image  segmentation. 
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Reflection  Vectors:  Eight  Plastic  Objects  under  White  Light 


dark  red  donut 
yellow  cup 
green  cup  (right  half) 
red  cup 
yellow  donut 
bright  red  donut 
green  donut 
green  cup  (left  half) 
blue  donut 

illumination  vector: 

-  computed  by  alg. 

-  independent  meas. 


body  reflection  \ 

(0.99,0.11,0. 

(0.84,0.52,0. 

(0.27,0.89,0.: 

(0.95,0.26,0. 

(0.77,0.61,0. 

(0.98,0.18,0. 

(0.21,0.78,0.1 

(0.27,0.86,0- 

(0.00,0.25,0.! 


surface  reflection  vector 

(0.62,0.52,0.59) 

(0.48,0.62,0.62) 

(0.54,0.58,0.61) 

(0.68,0.48,0.56) 

(0.69,0.51,0.51) 

(0.77,0.61,0.18) 

(0.47,0.63,0.31) 

(0.61,0.52,0.59) 

(0.49,0.63,0.60) 


(0.57,0.58,0.57) 

(0.58,0.57,0.58) 


Table  9-1 :  Body  and  surface  reflection  vectors  of  the 
eight  plastic  objects  under  white  light 


Reflection  Vectors:  Plastic  Cups  under  White  Light 

body  reflection  vector 

surface  reflection  vector 

green  cup 
yellow  cup 
orange  cup 

(0.22,0.91,0.37) 

(0.81,0.57,0.15) 

(0.95,0.26,0.15) 

(0.46,0.51,0.72) 

(0.56,0.56,0.61) 

(0.47,0.58,0.66) 

illumination  vector: 

-  computed  by  alg. 

-  independent  meas. 


(0.69,0.62,0.38) 

(0.58,0.57,0.58) 


Table  9-2:  Body  and  surface  reflection  vectors  of  the 
three  plastic  cups  under  white  light 


Reflection  Vectors:  Plastic  Cups  under  Yellow  Light 

body  reflection  vector 

surface  reflection  vector 

green  cup 
yellow  cup 
orange  cup 

(0.49,0.86,0.16) 

(0.85,0.52,0.08) 

(0.97,0.24,0.08) 

(0.76,0.62,0.20) 

(0.74,0.65,0.14) 

(0.73,0.66,0.18) 

illumination  vector: 

-  computed  by  alg. 

-  independent  meas. 

(0.81,0.56,0.19) 

(0.73,0.66,0.19) 

Table  9-3:  Body  and  surface  reflection  vectors  of  the 
three  plastic  cups  under  yellow  light 
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Abstract 

In  this  paper,  we  develop  a  color  space  metric  which 
is  useful  for  computer  vision.  While  there  is  no  short¬ 
age  of  proposed  methods  for  quantifying  how  the  hu¬ 
man  eye  perceives  color  differences,  these  perceptual 
metrics  are  inherently  inappropriate  for  computer  vi¬ 
sion.  The  metric  we  develop  is  appropriate  for  images 
sensed  using  color  filters  (e.g.  R,  G,  B)  as  is  often  done 
in  computer  vision.  Our  approach  is  more  general  than 
previous  approaches  in  that  our  analysis  makes  use  of 
the  spectral  characteristics  of  the  filters  and  camera 
and  accounts  for  their  noise  properties.  Our  color  dis¬ 
tance  function  is  derived  as  an  estimate  of  the  physical 
distance  between  normalized  spectral  power  distribu¬ 
tions.  Components  of  this  distance  are  weighted  to 
account  for  sensor  noise  properties.  A  color  metric  has 
several  possible  uses  in  a  computer  vision  system.  A 
color  metric  is  useful  for  detecting  color  edges,  clas¬ 
sifying  intensity  edges,  and  estimating  color  variation 
within  image  regions.  We  demonstrate  the  usefulness 
of  our  color  metric  by  evaluating  its  performance  on 
regions  of  real  images. 

1  Introduction 

Until  recently,  most  work  in  computer  vision  has  dealt 
exclusively  with  images  obtained  using  a  single  sensor. 
It  is  perhaps  for  this  reason  that  many  of  the  available 
algorithms  for  color  images  are  straightforward  mul¬ 
tidimensional  generalizations  of  algorithms  originally 
intended  for  intensity  images.  Many  of  these  gener¬ 
alized  algorithms  have  bee:.  effectively  applied  to  real 
images,  e.g.  [12].  It  is  reasonable  to  expect,  however, 
that  better  uses  for  color  information  can  be  obtained 
by  first  examining  the  physics  that,  is  specific  to  the 
formation  of  color  images.  Only  then  can  we  objec¬ 
tively  say  whether  color  images  contain  useful  informa¬ 
tion  which  is  not  present  in  intensity  images  and.  if  s... 
how  this  information  can  be  reliably  extracted. 


Perhaps  the  most  important  advantage  of  having 
color  information  in  addition  to  image  irradiance  in¬ 
formation  is  that  the  imaged  “color”  of  a  surface  of 
one  material  is  more  stable  under  changes  in  geometry 
than  the  corresponding  image  irradiance  values.  This 
assumes,  of  course,  that  some  mechanism  is  available 
for  factoring  the  intensity  component  out  of  a  color  sig¬ 
nal,  so  that  it  makes  sense  to  talk  about  independent 
intensity  and  color  attributes  of  light.  Shafer  captured 
the  essence  of  this  relationship  in  his  dichromatic  reflec¬ 
tion  model  [15],  which  decomposes  both  the  specular 
and  body  components  of  reflected  light  into  an  “inten¬ 
sity  factor”  which  depend.'  on  geometry  but  is  inde¬ 
pendent  of  wavelength  and  a  “spectral  factor”  which 
depends  on  wavelength  but  is  independent  of  geome¬ 
try.  Although  physically  there  are  small  -1  pendencies 
of  spectral  reflectance  on  geometry  [5],  the  dichromatic 
reflection  model  is  a  very  useful  approximation  and  has 
led  to  some  impressive  results  in  recovering  intrinsic 
images  which  separate  highlights  from  body  reflection 
[6].  Thus,  given  that  the  “color”  of  an  imaged  surface 
is  more  stable  than  its  image  irradiance  values,  it  is 
worthwhile  in  computer  vision  to  study  the  properties 
of  color  in  detail. 

In  this  paper,  wo  develop  a  color  metric  for  computer 
vision.  This  metric  is  intended  to  quantify  the  signif¬ 
icance  of  a  color  change  in  an  image.  Mu  a  work  has 
been  done  in  an  attempt  to  quantify  perceptual  color 
differences  [18].  Unfortunately,  there  does  not  exist  an 
obvious  relationship  between  the  way  color  is  perceived 
by  a  human  and  the  way  color  is  sensed  by  a  computer 
vision  s\  tem.  Some  work  m  computer  vision  lias  been 
done  on  developing  color  metrics  for  the  purpose  of  de¬ 
tecting  color  edges  [III],  [11],  [1  111.  None  1 he  proposed 
metrics,  however,  consider  the  underlying  properties  of 
the  sensors  being  used.  In  this  work,  we  develop  a 
more  general  color  metric  which  accounts  for  the  spec¬ 
tral  p  roper  t  ies  of  the  camera  and  filters  and  t  Inur  n-a 
chnrarterist  ics. 

Me  do  not  believe  th.'lt  color  eduo  detection  Is  a 


particularly  useful  step  to  be  taken  in  computer  vi¬ 
sion.  The  experiments  of  Novak  and  Shafer  [11]  and 
Nevada  [10]  have  demonstrated  that  a  large  majority 
of  detected  color  edges  are  also  detected  as  irradiance 
edges.  Interestingly,  human  subjects  have  been  unable 
to  bring  color  edges  into  focus  when  the  color  edge  does 
not  coincide  with  a  brightness  edge  [17]. 

We  intend  to  apply  our  color  metric  to  certain  regions 
of  an  image  following  the  detection  of  image  irradiance 
edges.  This  is  particularly  useful  for  classifying  irradi¬ 
ance  edges  during  physical  segmentation  as  discussed 
in  [5].  For  example,  given  an  object  of  a  single  ma¬ 
terial  which  is  illuminated  by  a  single  spectral  power 
distribution,  there  will  be  an  irradiance  edge,  but  not  a 
'‘color”  edge  across  a  surface  orientation  discontinuity. 
Similarly,  a  specular  irradiance  edge  in  an  image  will 
indicate  the  presence  of  an  inhomogeneous  material  if 
there  is  a  corresponding  “color”  edge  [5].  On  the  other 
hand,  a  specular  irradiance  edge  without  a  “color”  edge 
indicates  the  presence  of  a  homogeneous  material.  Our 
color  metric  is  also  useful  for  estimating  the  color  vari¬ 
ation  within  image  regions  of  continuous  image  irradi¬ 
ance.  This  variation  has  a  strong  relationship  to  the 
material  composition  of  an  object’s  surface.  We  hope 
to  be  able  to  use  this  measure  for  object  recognition. 

We  begin  in  section  2  with  a  description  of  color  sens¬ 
ing.  We  show  that  given  a  finite  number  of  color  filters 
it  is  possible  to  recover  a  finite  dimensional  approxi¬ 
mation  to  a  color  signal.  In  section  3,  we  introduce 
our  color  normalization  procedure  and  develop  a  met¬ 
ric  space  for  continuous  color  signals.  The  analysis  of 
section  3  is  specialized  to  a  finite  dimensional  represen¬ 
tation  in  section  4.  The  effects  of  noise  on  our  metric 
space  are  considered  in  section  5.  Section  6  presents 
experimental  results. 

2  Color  Sensing 

The  image  sensors  considered  in  this  work  measure  the 
irradiance  of  incident  light  across  a  plane.  The  prop¬ 
erties  of  the  light  impinging  on  the  sensor  plane  can 
be  characterized  by  the  function  I(x,y,  A)  where  I  is 
irradiance,  x  and  y  are  spatial  coordinates  in  the  im¬ 
age  plane,  and  A  is  wavelength.  Information  about  the 
spectral  properties  of  the  incident  light  can  be  obtained 
by  using  several  sensors  (e.g.  color  filters)  with  differ¬ 
ing  wavelength  responses.  For  the  remainder  of  this 
section,  we  consider  a  fixed  image  location  (xn,yo)  and 
abbreviate  I(x0,y0,X)  by  /(A). 

It  has  been  standard  in  computer  vision  to  digitize  an 
image  of  RGB  values  using  “red”,  “green”,  and  “blue" 
filters  and  to  refer  to  these  RGB  values  as  if  they  rep¬ 
resent  something  fundamental.  In  reality,  there  are  an 


infinite  number  of  combinations  of  filters  and  cameras 
that  might  be  used  to  obtain  RGB  values.  In  general, 
different  “red”  filters  placed  in  front  of  different  CCD 
cameras  will  give  us  different  R  values  for  the  same 
incident  light.  In  this  work,  we  consider  color  descrip¬ 
tions  that  are  related  to  the  physical  properties  of  the 
incident  light  and  that  are  therefore  somewhat  less  de¬ 
pendent  on  the  actual  sensors  used. 

We  now  describe  a  method  for  recovering  an  approx¬ 
imation  to  the  function  /(A)  at  a  point  in  an  image.  At 
each  image  point,  we  measure  the  outputs  s,-  of  n  sen¬ 
sors.  Each  sensor  has  a  certain  spectral  response  which 
we  denote  by  the  function  /,( A).  In  a  typical  system, 
each  function  /,( A)  will  correspond  to  the  product  of  a 
color  filter  transmission  function  with  the  spectral  re¬ 
sponse  of  the  camera.  Therefore,  at  each  image  point 
we  have  the  measured  values  s,-  (0  <  i  <  n  —  1 )  given 

by 


s>  =  J  /i(A)/(A)<fA 


where  A  ranges  over  the  entire  electromagnetic  spec¬ 
trum.  In  this  work,  we  restrict  our  interest  in  the  be¬ 
havior  of  /(A)  to  the  visible  region  of  the  spectrum. 
Therefore,  the  functions  /, ( A )  will  be  nonzero  only  for 
values  of  A  in  the  visible  range.  We  denote  the  visible 
range  by  [A,,  As].  Since  many  cameras  respond  to  light 
in  the  near  infrared  region  of  the  spectrum,  it  may  be 
necessary  to  use  an  1R  cutoff  filter  to  block  light  for 
which  A  >  A2. 

Suppose  we  approximate  the  function  /(A)  by  a  lin¬ 
ear  combination  of  m  basis  functions  bj(X).  If  we  let  (ij 
be  the  components  of  /(A)  on  this  basis,  then  we  have 

I(  A)*  E  aM*)  (2-2). 

0<J  <m— 1 

Substituting,  (2.1)  becomes 


S,=  [X’f,(X)(  E  (ij  bj  ( 
■'A‘  A0<;<»1-1 


(A)  )d X  (2.3) 


which  may  be  written 


■S=  E  «>(/  ’  UWjW'tx)  (2,1 

0  <  j  <  m  -  1  A“'A‘  ' 


(2.5) 


combination  of  basis  functions.  For  now,  we  are  in¬ 
terested  in  an  approximation  to  7(A)  and  not  in  an 
approximation  to  surface  reflectance. 


kij=  t2  fiWbj(\)dX 


denote  the  integral  in  (2.4).  Then  Jfc,y  is  a  constant 
which  depends  only  on  the  ith  filter  function  and  the 
jth  basis  function.  Let  S  be  the  n-dimensional  vec¬ 
tor  defined  by  S  =  [so,  ®i  i  •  •  ■  >  Sn-i]T-  Let  I<  be 
the  n  x  m  constant  matrix  defined  by  K(i,j)  =  kij. 
Let  A  be  the  m-dimensional  vector  defined  by  A  = 
[a0l  uj , . . . ,  um.l]T.  Then  we  have  the  linear  system  of 
n  equations 


S  -  I<A  (2.6). 

If  we  choose  our  filters  /j(A)  and  basis  functions  6j(A) 
such  that  I\  has  maximal  rank,  then  the  n  sensor  out¬ 
puts  soi  ®ii  uniquely  determine  n  components 

a0,ai, ...,  an_!  of  7(A).  Therefore,  by  letting  m=n  in 
(2.6)  we  can  recover  an  approximation  to  the  function 
7(A)  given  by 


7(A)  =  Bt(\)A  (2.7  a) 

where 

A  =  K~lS  (2.76) 

and  77(A)  is  the  vector  [60(A),  61(A), . . .  ,6n_i(A)]T. 

In  summary,  7(A)  is  the  unique  approximation  to 
7(A)  which  lies  in  the  space  spanned  by  the  n  basis 
functions  60(A),  6 1 ( A) ,  .  . .  ,  6„_i(A)  and  which  also  sat¬ 
isfies  the  n  constraints  of  (2.1). 

There  are  two  sources  of  error  in  the  approxima¬ 
tion  7(A).  The  first  source  of  error  is  noisy  sen¬ 
sor  measurements  5.  The  second  source  of  error 
is  the  possibility  that  7(A)  cannot  be  exactly  repre¬ 
sented  in  the  space  spanned  by  the  basis  functions 
60(A), 61(A), .  . .  ,6„_|(A).  The  effects  of  sensor  noise  are 
discussed  in  section  5.  A  discussion  of  approximation 
error  is  not  central  to  the  development  of  this  work, 
but  is  included  for  completeness  in  an  appendix. 

1  he  color  recovery  method  described  in  this  section 
has  frequently  been  confused  with  techniques  that  ex¬ 
press  surface  reflectance  as  a  linear  combination  of  ba¬ 
sis  functions.  We  believe  that  expressing  surface  re¬ 
flectance  using  such  a  linear  model  was  first  done  by 
Sallstrorn  [14]  and  subsequently  used  by  many  others, 
e.g.  [2],  [■'!].  [7],  In  this  section,  we  are  doing  some¬ 
thing  which  is  fundamentally  different.  We  express  the 
function  /(A)  (not  the  surface  reflectance)  as  a  linear 


3  A  Metric  Space  for  Normalized 
Color 

In  this  section,  we  develop  a  normalization  procedure 
for  color.  This  normalization  is  motivated  by  the 
physics  of  color  image  formation  and  preserves  what 
we  feel  are  the  most  important  properties  of  color.  Af¬ 
ter  defining  the  normalization  procedure,  we  define  a 
distance  function  on  the  space  of  normalized  colors. 
We  note  that  the  analysis  of  this  section  assumes  the 
recovery  of  a  continuous  function  7(A)  using  noiseless 
sensors.  In  section  4  we  specialize  the  analysis  to  finite 
dimensional  approximations  like  those  recovered  using 
the  technique  of  section  2.  In  section  5  we  consider  the 
effects  of  sensor  noise. 

We  begin  by  distinguishing  the  total  power  of  the 
function  7(A)  from  its  color.  The  total  power  of  7(A)  is 
given  by  the  L1[A1,A2]  norm 

!|7(A)||i  =  /"Aa  7(A)dA  (3.1). 

We  define  the  normalized  physical  color  of  7(A)  by  the 
function  7(A)  having  unit  total  power 


7(A)  = 


l|/(A)||1 


(3.2) 


The  space  of  normalized  physical  colors  is  the  space 
of  all  continuous  nonnegat  ive  functions  /(A)  on  [Ai ,  A-j] 
having  unit  total  power. 

In  [5]  it  is  shown  that  to  a  very  good  approximation 
spectral  reflectance  is  independent  of  geometry.  If  we 
assume  this  independence  and  a  technique  for  removing 
highlights  such  as  [6],  then  we  get  the  desirable  prop¬ 
erty  that  for  a  surface  of  a  single  material  illuminated 
by  a  single  spectral  power  distribution  the  normalized 
physical  color  of  all  image  points  corresponding  to  t  hat 
surface  will  be  identical.  In  this  sense,  (3.2)  serves  to 
factor  out  geometric  information  and  preserve  informa¬ 
tion  about  source  color  and  surface  spectral  reflectance. 

Given  any  two  normalized  physical  colors  7]  (A)  and 
72(A)  we  define  the  distance  from  7| (A)  to  l-,( A)  by  the 
7/J[Aj,A2]  distance  function 


d(/,(A)./,(A)) 


hi  A)  <n 


(3  3) 
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For  reasons  relevant  to  our  application,  we  use  the 
iL2 [A i ,  A2]  distance  of  (3.3)  rather  than  the  commonly 
used  maximum  distance  defined  by 


dm*:(h( A),  /2(A))  =  max  |/,(A)  -  /2(A)|  (3.4). 

AinAv A 2 

While  inaccuracies  over  a  small  range  of  A  can  cause 
large  deviations  of  dmax ,  the  distance  d  of  (3.3)  depends 
on  the  distance  between  the  functions  integrated  over 
the  entire  spectrum.  Therefore,  the  L2[A],A2]  distance 
will  usually  give  a  more  reliable  characterization  of  the 
distance  between  two  measured  functions  than  the  dis¬ 
tance  dmax  . 

Although  we  have  not  motivated  our  decision  to 
compute  distances  between  functions  normalized  in 
L1  [Ai,  A2]  using  the  L2[Ai,  A2]  metric,  functions  treated 
in  this  way  have  several  useful  properties.  These  prop¬ 
erties  are  discussed  in  [5]. 

4  A  Finite  Dimensional  Representa¬ 
tion 

The  functions  /(A)  in  the  discussion  of  section  3  were 
assumed  to  be  arbitrary  nonnegative  continuous  func¬ 
tions  on  [A  j ,  Aj],  i.e.  nonnegative  elements  of  the  space 
C[Ai,  A2].  Using  the  color  recovery  method  of  section 
2,  however,  we  are  only  able  to  recover  finite  dimen¬ 
sional  approximations  to  these  functions.  In  this  sec¬ 
tion,  we  specialize  the  analysis  of  section  3  to  finite 
dimensional  approximations  which  can  be  recovered  us¬ 
ing  the  method  of  section  2.  Figure  1  depicts  the  dif¬ 
ferent  levels  of  color  representation  in  our  system. 

In  general,  we  begin  by  restricting  ourselves  to  some 
11-dimensional  subset  S  of  C'[Ai ,  A2],  If  S  is  specified  by 
a  set  of  basis  functions,  then  the  Gram-Schmidt  process 
[4]  can  be  used  to  construct  an  orthonormal  basis  for 
S.  This  orthonormal  basis  may  then  be  taken  to  be  the 
basis  B{ A)  of  section  2. 

For  the  purpose  of  concreteness,  we  will  takeS  to  be 
the  set  of  polynomials  of  degree  n-1  and  the  orthonor- 
rnal  basis  to  be  the  first  r.  normalized  Legendre  polyno¬ 
mials.  'flic  normalized  Legendre  polynomials  are  given 
by 


/-.(A)  =  \P~/J,(A)  (4.1) 

where  P,(X)  is  the  Legendre  polynomial  of  degree  i. 
The  functions  ]>,(X)  are  orthonormal  in  the  sense  that 


1,  if  i  =  j\ 
0.  if  i  /-  j. 


(4.2) 


We  note  that  the  analysis  of  this  section  applies  more 
generally  to  any  orthonormal  basis  for  which  one  of  the 
basis  functions  is  a  constant. 


4.1  Computing  Normalized  Physical  Color 

U\) 

Using  the  normalized  Legendre  polynomial  representa¬ 
tion,  the  technique  of  section  2  recovers  an  approxima¬ 
tion  to  /(A)  of  the  form 

/( A)=  X  h-3) 

0  <  i  <  m  —  1 

The  total  power  of  /(A)  is  given  by 

\\I(X)\l  =  /*’  I(X)dX  =  /  ’  [  X 

*'A>  -'A>  0<i<n- 1 

(4-4) 

where  the  last  step  follows  from  the  orthogonality  of  the 
functions  p;( A).  Therefore,  the  total  power  of  /(A)  may 
be  determined  by  only  considering  the  first  coefficient 
in  the  normalized  Legendre  polynomial  representation. 
From  the  total  power  of  /( A),  we  can  determine  the 
normalized  physical  color  /(A)  by 


I(  A)  = 


/(A) 

'/2«o 


X  a>T.(A) 

0  <  1  <  n  —  1 


where 


\/2rto 


(4.5) 


(4.6) 


4.2  Metric  Space  Properties 

For  two  functions  I\(X)  and  / -j ( A )  given  by 


M  A)  =  X  r<J'i(A).  h(  A)  =  X  •s'Z'i(  A) 

(4.7) 

the  normalized  physical  colors  are 


/.(  A)=  X  MA')=  X  A) 

n<i<n-l  0  <  i  <  ti  —  1 

<  I  S) 

where  the  r ,  and  s,  are  computed  as  in  (  l  td. 

The  color  spare  distance  bet  ween  /  j  (  A )  and  /■_.(  A)  is 
given  by 


8S7 


"'or  a 


-■.-’■J 


4^ 


d(/,(A),/j(A))=  X  (n-Si)Pi(A)]^A 

\  ‘/A|  0 < t < n  —  1 


which  simplifies  because  of  orthogonality  to 


d(/,(  A),/2(A))=  /  X  (r.-s.)2  (4-10) 


4.3  Color  Space  as  a  Subset  of  ft"  1 

From  (4.5),  our  representation  for  normalized  physical 
colors  is 


/( A)=  X  °<P.(A) 


(4.11). 


From  (4.6)  we  see  that  all  normalized  physical  colors 
have  d0  =  1  /\f2.  We  can  write 

=  X  *k(a)  (412)- 


5  Modeling  Sensor  Noise 

If  we  are  able  to  make  perfect  measurements 
so,  si, ...  ,s„_i,  then  (2.7)  allows  us  to  compute  the 
approximation  to  /(A)  given  by 


/(A)  =  Br(X)A  (5.1) 

In  real  physical  situations,  each  of  the  measurements 
so,  , .  -  - ,  «n-i  will  have  some  amount  of  variability. 
In  this  section,  we  examine  how  the  variance  in  the 
measurements  S  affects  the  computed  approximation 
of  (5.1). 

Let  A  denote  E{A)  and  ~S  denote  F(S).  From  (2.7b), 
we  have 

A=  E[K~KS)=z  K~lS  (5.2) 

Let  be  the  n  x  n  covariance  matrix  of  A  defined  by 


=  E[(A-A)(A-A)r]  (5.3) 

=  E[(K~1S-  K~lS)(K~lS  -  K~lS)r]  (5.4) 
=  K~lE[(S  -  S)(S  -  5)7’](A-1)T  (5.5) 


We  can  consider  the  normalized  physical  color  /(A)  to  which  may  be  written 
be  the  point  in  ft"''1  with  coordinates  di  ,d2,. . .  ,dn_i . 


Since  we  require  /(A)  >  0  on  A[  <  A  <  A2,  normalized 
physical  colors  form  a  proper  subset  of  ft"-1.  To  be 
precise,  normalized  physical  colors  are  those  points  of 
ft""1  for  which 


X  ®*P«(A)  >  -r  —  1  <  A  <  1 


Let  C"  1  be  the  set  of  points  in  ft"  1  which  are  nor¬ 
malized  physical  colors.  Points  in  6'"- 1  will  be  referred 


A  =  (dj  ,d2, . . .  ,d„_i)  (414) 

where  the  coordinates  at ,  «2>  . .  . ,  dn-1  have  the  same 
significance  as  in  (4.12). 

An  important  consequence  of  viewing  colors  as  points 
in  Cn~ 1  is  that  the  usual  euclidean  metric  in  ft"-1 
is  equivalent  to  the  metric  we  defined  for  normalized 
physical  colors  in  section  3.  This  is  seen  by  examining 
(4.10)  and  recalling  that  r0  =  so-  Thus,  our  repre¬ 
sentation  in  terms  of  normalized  Legendre  polynomials 
allows  intuition  about  the  familiar  distance  in  ft"-1  to 
be  applied  to  distance  in  color  spare. 


i]4  =  A-1Es(ft-1)7  (5.6) 

where  Es  is  the  n  x  n  covariance  matrix  of  S. 

For  most  real  sensors,  it  is  reasonable  to  assume  that 
A  has  the  multivariate  normal  density  given  by 


p(A)  = 


(2jr)"^2||E/t||1/2 


(57) 


Contours  of  constant  density  are  hyperdlipsoids  satis¬ 
fying 

(.4-  J)tV~'(A-A)  =  C  (5.8) 

where  C  is  a  constant.  Thus,  from  E^  we  can  determine 
the  scatter  of  the  data  in  any  direction.  It  follows  that  , 
in  general,  points  in  (7"-1  will  tend  to  exhibit  greater 
dispersion  in  certain  directions  than  in  others.  Thus 
the  euclidean  metric  for  C"-1  which  is  appropriate  for 
noiseless  measurements  (see  section  I)  will  be  inappro¬ 
priate  for  real  sensor  measurements.  This  is  because 
directions  in  along  which  sensor  noise  is  ampli¬ 

fied  ran  dominate  the  euclidean  distance  of  (  I  1(1)  I|  is 


Im 

K5r? 


mwjat1'  vjy.  yjwyi  r  v  ?■  u<  tf»  r*  w  r»  %_*■.»  ^  v  rr 


■  H."  *-~  nr.  m  *.- 


desirable  to  develop  a  color  metric  which  can  take  into 
account  this  anisotropic  property  of  Cn-1  and  thereby 
produce  a  distance  between  two  normalized  physical 
colors  which  is  relatively  independent  of  scatter  due  to 
noise  We  propose  as  a  metric  the  Mahalanobis  dis¬ 
tance  [16]  given  by 

d(Al,A2)=^(Al-A2)TE:A1(A1~A2)  (5.9) 

where  A i  and  A2  are  points  in  Cn_1.  The  metric  of 
(5.9)  has  the  effect  of  normalizing  by  the  variance  in 
each  direction,  thus  giving  a  more  representative  esti¬ 
mate  of  the  physical  distance  between  A\  and  A2  than 
the  euclidean  distance.  We  observe  that  if  is  uni¬ 
tary,  then  C”-1  is  isotropic.  As  expected,  for  this  case 
(5.9)  simplifies  to  the  scaled  euclidean  distance. 

6  Experiments 

In  this  section  we  show  the  results  of  some  preliminary 
experiments  using  the  metric  of  (5.9).  For  these  ex¬ 
periments  we  digitized  a  color  image  of  the  valve  part 
from  the  SUCCESSOR  vision  system  project  [1].  Four 
different  sensors  were  used  to  obtain  the  color  informa¬ 
tion.  The  spectral  response  curves  for  the  four  sensors 
are  shown  in  figure  2.  The  large  amplitude  curve  indi¬ 
cates  the  spectral  response  of  the  camera  and  the  three 
other  curves  indicate  the  spectral  response  of  the  cam¬ 
era  multiplied  by  the  response  of  the  color  filters  used. 
These  curves  were  taken  from  manufacturer’s  specifi¬ 
cations.  In  the  future,  we  hope  to  be  able  to  measure 
these  curves  directly.  The  basis  functions  used  for  the 
experiments  are  the  normalized  Legendre  polynomials 
of  section  4.  The  resulting  matrix  A'-1  is 
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-0.279 
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0.463 

0.648  \ 

-0.609 

0.964 

0.996 

0.694 

-0.507 

0.751 

0.578 

1 .082 

\- 

-0.446 

0.531 

0.892 

0.568/ 

Here  we  will  show  the  use  of  our  color  metric  in  de¬ 
tecting  material  edges.  Figure  3  shows  a  scatterplot 


the  material  edge.  The  large  central  maximum  indi¬ 
cates  the  presence  of  a  color  edge  which  in  this  case 
corresponds  to  a  material  edge.  Figures  5  and  6  show 
similar  results  for  another  material  edge  in  the  same 
image. 

Since  intensity  information  has  been  factored  out, 
the  large  central  maxima  in  figures  4  and  6  indicate 
color  discontinuities.  We  note  that  it  is  not  necessary 
to  localize  edges  precisely  using  color  since  many  tech¬ 
niques  are  available  for  localizing  the  corresponding  in¬ 
tensity  edges,  e.g.  [9].  In  the  future,  we  hope  to  apply 
our  metric  to  estimating  color  variation  within  image 
regions  of  continuous  irradiance. 


I(\)=  J2  ^-(A)  (-11). 

0<t<n-  1 

The  best  approximation  /(A)  in  the  least  squares  sense 
is  the  choice  of  the  di  which  minimizes 

M=  [X2[l{\)~  Y,  d.AfrfA  (A2). 

-'A>  0<*<  n—  1 

We  observe  that  (2.1)  may  be  written 

«.  =  (/,,/)  (.4.3) 


Appen  'ix:  Sensor  Selection  and  Approximation  Error 
Given  the  recovery  technique  described  in  section  2, 
we  examine  how  the  choice  of  the  sensors  /,( A)  can  af¬ 
fect  the  quality  of  the  recovered  approximation  to  /(A). 

In  particular,  we  show  that  given  an  orthonormal  sys¬ 
tem  of  n  basis  functions,  it  is  possible  to  choose  n  sen¬ 
sors  such  that  the  recovered  approximation  /(A)  is  the 
best  least  squares  approximation  to  /(A)  in  the  space 
spanned  by  the  basis. 

More  formally,  let  ^o(A),  ^i(A), . . .  ^„_i(A)  be  an  or¬ 
thonormal  set  of  basis  functions  in  £2[Ai,A2].  Then 
given  n  sensor  measurements  S,  the  procedure  of  sec¬ 
tion  2  allows  us  to  recover  an  approximation  /(A)  of 
the  form 


of  points  in  a  2-D  cross  section  of  C3  across  an  edge 
corresponding  to  two  different  metals.  PI  indicates  the 
strength  of  the  component  corresponding  to  pl( A)  and 
P2  indicates  the  strength  of  the  component  correspond¬ 
ing  to  p2(  A)  using  the  color  recovery  method  of  section 
2.  The  points  lie  approximately  in  two  clusters  corre¬ 
sponding  to  the  two  metals,  with  some  blurring  across 
the  edge.  Figure  4  shows  the  result  of  applying  our 
color  metric  point  by  point  to  a  line  of  pixels  across 


where  (-,-)  denotes  the  L2[A|,A2j  inner  product.  Thus 
if  we  define 


/i(A)  =  4>i{X)  i  =  0,1,...,  n-l  (.4.4) 

then  we  will  have 


•S  =  (cV  /> 


(-45). 


[Vi 

m 


I 


I 


Since  the  <5, (A)  are  orthonormal  functions,  K  will  be 
the  identity  matrix.  From  (2.7),  the  technique  of  sec¬ 
tion  2  will  recover  the  approximation  /(A)  given  by 

/(!)'=  (A.6). 

0<*<n—  1 

It  is  easy  to  show  [4]  that  the  approximation  7(A)  given 
by  (A.6)  minimizes  M  and  that  the  error  is  given  by 

l|7(A)  —  7(A)*  ||2  =  ||/(A)||2  —  ||7(A)*||2  (A.7) 

Therefore,  it  is  possible  to  choose  sensors  /,( A)  such 
that  the  procedure  of  section  5  computes  the  best  least 
squares  approximation  to  an  arbitrary  function  7(A). 

It  is  natural  to  ask  how  large  n  should  be  in  order 
that  the  function  7(A)  is  a  good  approximation  to  7(A). 
This  depends,  of  course,  on  the  properties  of  the  func¬ 
tion  7(A)  and  how  well  these  properties  are  captured 
by  the  basis  functions  B(A).  Maloney  [8]  has  done  a 
thorough  analysis  of  reflectance  functions  77(A)  for  real 
materials  by  considering  both  empirical  data  and  the 
fundamental  physical  processes  which  determine  the 
properties  of  77(A).  He  has  concluded  that  at  least 
five  basis  functions  are  required  to  accurately  charac¬ 
terize  /7(A).  Unfortunately,  no  corresponding  analysis 
has  been  done  for  illuminants. 
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ABSTRACT  1.  INTRODUCTION 

Low  level  modern  computer  vision  is  not  domain  depen-  A  large  part  of  Low  and  Intermediate  Level  Vision,  i.e. 

dent,  but  concentrates  on  problems  that  correspond  to  problems  such  as  shape  from  texture,  shape  from  shading, 

identifiable  modules  in  the  human  visual  system  Several  structure  from  motion,  three-dimensional  motion  analysis  and 


theories  have  been  proposed  in  the  literature  for  the  computa¬ 
tion  of  shape  from  shading,  shape  from  texture,  retinal 
motion  from  spatiotemporal  derivatives  of  the  image  intensity 
function  and  the  like. 

The  problems  with  some  of  the  existing  approaches  are 
basically  the  following: 

(1)  The  employed  assumptions  are  usually  very  strong  (they 
are  not  present  in  a  large  subset  of  real  images),  and  so 
some  of  the  algorithms  fail  when  applied  to  real  images. 

(2)  Usually  the  constraints  from  the  geometry  and  the  phy¬ 
sics  of  the  problem  are  not  enough  to  guarantee  unique¬ 
ness  of  the  computed  parameters.  In  this  case,  strong 
additional  assumptions  about  the  world  are  used,  in 
order  to  restrict  the  space  of  all  solutions  to  a  unique 
value. 

(3)  Even  if  no  assumptions  at  all  are  used  and  the  physical 
constraints  are  enough  to  guarantee  uniqueness  of  the 
computed  parameters,  then  in  most  cases  the  resulting 
algorithms  are  not  robust,  in  the  sense  that  if  there  is  a 
slight  error  in  the  input  (i.e.  a  small  amount  of  noise  in 
the  image),  this  results  in  a  catastrophic  error  in  the  out¬ 
put  (computed  parameters),  and  this  is  observed  from 
experiments. 

It  turns  out  that  if  several  available  cues  are  combined, 
then  the  above  mentioned  problems  disappear  in  most  cases; 
the  resulting  algorithms  compute  robustly  and  uniquely  the 
intrinsic  parameters  (shape,  depth,  motion,  etc  ). 

In  this  paper  the  problem  of  machine  vision  is  explored 
from  its  basics.  A  low  level  mathematical  theory  is  presented 
for  the  unique  and  robust  computation  of  intrinsic  parame¬ 
ters  The  computational  aspect  of  the  theory  envisages  a 
cooperative  highly  parallel  implementation,  bringing  in  infor¬ 
mation  from  five  different  sources  (shading,  texture,  motion, 
contour  and  stereo),  to  resolve  ambiguities  and  ensure  unique¬ 
ness  of  the  intrinsic  parameters. 
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under  Contract  DAAB07-86-K-F073  is  gratefully  acknowledged,  as  is 
the  help  of  Sandy  German  and  Barbara  Bull  in  preparing  this  paper 


shape  from  contour,  depth  computation  and  determination  of 
light  source  direction,  are  studied.  All  these  problems  have 
been  studied  in  the  past,  using  very  few  information  sources. 
It  was  proved  that  in  order  to  achieve  uniqueness  of  the 
underlying  computations,  some  assumptions  should  be 
employed.  This  research  discovered  for  the  first  time  con¬ 
straints  relating  three-dimensional  information  with  two- 
dimensional  image  properties  and  resulted  in  algorithms  that 
worked  in  synthetic  laboratory  images  and  in  some  restricted 
domain  of  natural  images.  In  this  work,  we  study  the  above 
problems,  by  using  as  much  information  as  is  available  from 
different  cues  (stereo,  motion,  contour,  shading  and  texture). 
This  results  in  algorithms  that  provably  compute  uniquely 
what  they  are  supposed  to  compute. 

The  basic  ideas  in  this  paper  are  centered  around  the 
fact  that  minimal  assumptions,  uniqueness  and  stability, 
should  (and  can)  be  the  first  and  basic  requirements  for  a 
visual  computation. 

The  next  section  introduces  the  reader  to  basic  problems 
in  computer  vision  and  establishes  some  of  the  required 
nomenclature,  and  Section  3  reviews  a  large  part  of  previous 
work,  gives  the  motivation  for  the  research  needed  and 
describes  the  results  we  have  obtained 

2.  PREREQUISITES 

2.1.  The  Goal  of  Machine  Vision 

It  is  very  difficult  to  define  the  central  problem  of  com¬ 
puter  vision  or  vision  in  general,  as  in  many  other  scientific 
fields.  What  goes  on  inside  our  heads  when  we  see?  Most 
people  take  seeing  so  much  for  granted  that,  few  will  ever 
have  considered  the  question  seriously.  Here  we  attempt  to 
give  the  following  loose  definition  of  the  central  problem  of 
computer  vision: 

“The  central  problem  of  computer  vision  is:  from  one  or 
a  sequence  of  images  of  a  moving  or  stationary  object  or  a 
scene,  taken  by  a  monocular  f one  eye )  or  poly.iocular  (many 
eyes)  moving  o-  stationary  observer,  to  understand  the  object 
or  the  scene  ana  its  three-dimensional  properties.  " 


mi 


SKK 


s-V.v.v‘.vV.v:v‘ 


,nw 


i  j>»-tv^i»*.u-l.ii  A  u*  Ja  *»* 


The  reader  will  immediately  observe  that  all  the  terms  in 
the  above  definition  are  well  defined,  with  the  exception  of 
the  term  ‘understand”.  What  is  really  the  meaning  of 
“understand”  with  respect  to  this  problem?  The  problem  of 
finding  meaning  is  the  central  one  in  artificial  intelligence  and 
it  is  by  no  means  answered.  For  this  reason,  because  various 
researchers  understand  meaning  in  different  ways,  there  have 
basically  been  two  schools  of  thought  in  computer  vision. 
Although  no  clear  distinction  between  them  can  be  made,  we 
can  safely  differentiate  them  into  two  schools:  Reconstruction 
and  Recognition  The  reconstruction  school  worries  about  the 
reconstruction  of  the  physical  parameters  of  the  visual  world, 
such  as  the  d.pth  or  orientation  of  surfaces,  the  boundaries  of 
objects,  the  direction  of  light  sources  and  the  like.  The  recog¬ 
nition  school  worries  about  the  recognition  or  description  of 
objects  that  we  see  and  involves  processes  whose  end  product 
is  some  piece  of  behavior  like  a  decision  or  a  motion.  Both 
schools  have  strong  ties  with  psychology  and  neuroscience  and 
it  is  strongly  believed  at  this  point  that  both  schools  will 
merge  into  a  new  one  that  will,  it  is  hoped,  find  an  answer  to 
the  difficult  questions  of  the  vision  problem.  Figure  2.1  shows 
a  schematic  description  of  the  above  mentioned  schools  of 
thought. 

2.2.  The  Machine  Vision  Goal  Revisited 

Up  to  this  point  we  have  been  rather  general  since  we 
have  been  talking  about  computer  vision  as  having  as  its  goal 
the  development  of  a  universal  visual  system.  We  consider 


the  explanation  of  Horn  |1986|  very  much  adequate  and  we 
adopt  it  here.  To  be  more  specific,  a  machine  vision  system 
analyzes  images  and  produces  descriptions  of  what  is  imaged. 
These  descriptions  must  capture  the  aspects  of  the  objects 
being  imaged  that  are  useful  in  carrying  out  some  task.  So, 
we  consider  the  machine  vision  system  as  part  of  a  larger 
entity  that  interacts  with  the  environment.  The  vision  sys¬ 
tem  can  be  considered  an  element  of  a  feedback  loop  that  is 
concerned  with  sensing,  while  other  elements  are  dedicated  to 
decision  making  and  the  implementation  of  these  decisions. 
The  input  to  the  machine  vision  system  is  an  image,  or 
several  images,  while  its  output  is  a  description  that  should 
satisfy  at  least  the  following  two  criteria  (following  Horn 
[1986]): 

a)  It  must  bear  a  relevant  relationship  to  what  is  being 
imaged ; 

b)  It  must  contain  all  the  information  needed  for  the  specific 
task. 

Obviously,  the  first  criterion  ensures  that  the  description 
depends  in  some  way  on  the  visual  input.  The  second  ensures 
that  the  information  provided  is  useful.  Something  has  to  be 
said  about  the  concept  of  description  that  we  used  above.  An 
object  does  not  have  a  unique  description.  We  can  think  of 
descriptions  at  many  levels  of  detail  and  from  many  points  of 
view.  Fortunately,  we  can  avoid  this  potential  philosophical 
snare  by  considering  the  task  for  which  the  description  is 
intended.  That  is,  we  do  not  want  just  any  description  of 
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what  is  imaged,  but  one  that  allows  us  to  take  appropriate 
action. 

An  example,  borrowed  from  Horn  [1986],  may  help  to 
clarify  these  ideas  Consider  the  task  of  picking  up  parts 
from  a  conveyor  belt.  The  parts  may  be  randomly  oriented 
and  positioned  on  the  belt.  There  may  be  several  different 
types  of  parts,  with  each  to  be  loaded  into  a  different  fixture 
The  vision  system  is  provided  with  images  of  the  objects  as 
they  are  transported  past  a  camera  mounted  above  the  belt. 
The  descriptions  that  the  system  has  to  produce  in  this  case 
are  simple.  It  need  only  give  the  position,  orientation  and 
type  of  each  object.  This  description  may  be  just  a  few 
numbers.  In  other  situations  an  elaborate  symbolic  descrip¬ 
tion  may  be  needed.  Figure  2.2  depicts  a  vision  system. 

2.3.  Current  Research  Status 

For  a  clear  exposition  on  how  the  field  shaped  up  in  its 
current  form,  see  [Horn,  1986],  Modern  computer  vision  wor¬ 
ries  about  concentrating  on  topics  that  correspond  to 
identifiable  modules  in  the  human  visual  system.  And 
although  we  don’t  know  what  exactly  these  modules  are,  we 
understand  that  there  should  exist  modules  that  compute  3-D 
parameters  form  specific  cues,  such  as  shading,  motion,  stereo, 
contours  and  texture.  When  we  say  3-D  parameters,  we  mean 
intrinsic  images,  such  as  shape,  depth,  reflectance,  three- 
dimensional  motion,  illuminant  direction  and  the  like.  So, 
one  could  say  that  a  large  part  of  today’s  research  is: 

Compute  Y  from  X. 

where  Y  is  an  intrinsic  property  (shape,  depth,  retinal  and 
three-dimensional  motion,  etc.)  and  X  is  a  cue  in  the  image  or 
a  property  of  the  observer  (shading,  texture,  stereopsis,  etc.) 

The  following  figure  broadly  summarizes  the  status  of 
contemporary  reconstructionist  computer  vision.  On  the 
right,  we  see  the  various  cues,  and  on  the  left  the  intrinsic 
parameters  Research  tries  to  recover  from  any  of  the  cues  in 
the  right  some  of  the  intrinsic  properties  on  the  left  An 


arrow  from  box  1  to  box  2  indicates  that  the  property  in  box 
2  is  recovered  from  the  cue  in  box  1.  The  names  along  the 
arrows  represent  some  of  the  researchers  who  have  worked  on 
this  specific  recovery.  More  complete  references  can  be  found 
in  the  rest  of  the  paper.  At  this  point  we  have  to  make  clear 
that  the  intrinsic  parameters  about  which  we  are  writing  can 
basically  be  classified  in  two  categories:  relinotopic  and  non- 
retinotopic.  The  non-retinotopic  ones  can  be  divided  into 
features  (physical  parameters)  and  objects  and  relations  [Bal¬ 
lard,  1984).  The  retinolopic  ones  (shape,  depth  and  the  like) 
are  the  ones  of  most  interest  in  this  paper.  These  parameters 
are  spatially  indexed  at  every  image  point.1  One  might  say 
that  the  relinotopic  parameters  are  the  basic  subject  of  the 
Reconstruction  School,  and  the  non-retinotopic  ones  (features) 
of  the  Recognition  School  In  this  paper  we  will  mostly  be 
talking  about  Low-Level  Vision,  and  so  the  analysis  of  three- 
dimensional  shape  models  and  transformations,  as  part  of 
High-Level  Vision  modules,  won’t  be  treated  Finally,  it  has 
to  be  said  that  the  current  status,  Figure  2.3,  is  by  no  means 
complete.  Other  sources  of  information  such  as  color  and 
nonplanar  contours  are  of  great  importance,  but  we  will  not 
discuss  them  here. 

2.4.  A  Word  of  Caution  and  What  is  to  Come 

In  the  preceding  sections,  we  have  emphasized  that  con¬ 
temporary  computer  vision  is  worrying  about  the  recovery  of 
three-dimensional  (world)  properties  from  two-dimensional 
image  properties.  By  no  means  do  we  imply  that  this  is  the 
only  issue  of  today’s  research.  There  is  a  lot  of  excellent 
research  on  low  and  high  level  vision,  object  recognition  and 
navigation.  We  feel  that  the  bulk  of  research  (from  2-D  pro¬ 
perties  to  3-D  properties)  is  important  because  a  clear  under¬ 
standing  of  these  issues  will  contribute  a  great  deal  to  our 
knowledge  of  extrapersonal  space  perception,  to  our  under¬ 
standing  of  the  cortex  and  to  our  ability  to  construct 
machines  with  visual  sense.  Simple  thinking  may  convince  us 
that  if  we  ever  hope  to  understand  how  the  visual  system 

'But  we  treat,  the  recovery  of  3-D  motion  and  light  source 
direction,  which  are  not  relinotopic.  but  global  parameters. 


-n  v >jy- yyvv v . 


W.T.Vi 


•fvwyv 


Current  status 


Marr.  Poggio,  Mayhew, 


Figure  2.3:  Current  research  status. 


works,  we  must  first  understand  that  our  only  input  is  two- 
dimensional  images,  and  so  in  order  to  reason  about  the 
three-dimensional  world,  we  must  discover  constraints 
between  the  images  and  the  three-dimensional  world  that  is 
imaged.  On  the  other  hand,  prior  knowledge  about  the  world 
can  be  of  great  help  We  are  not  opposed  to  using  a  prion 
knowledge  about  the  world  in  order  to  help  the  process  of 
understanding  the  3-D  space  from  its  images.  But  before  we 
do  that,  we  should  first  analyze  the  various  vision  problems 
with  as  few  assumptions  as  possible,  and  if  no  solution  is  pos¬ 
sible.  then  we  should  resort  to  additional  assumptions,  exactly 
as  previous  research  (Fig  2.3)  discovered  constraints  from 
single  cues, 

3.  UNIQUE  INTRINSIC  IMAGES:  THE  PROBLEM, 
THE  ANSWER  AND  TECHNICAL  PREREQUISITES 

Here  we  describe  what  the  problems  of  the  current 
research  status  are  and  we  propose  a  new  approach.  In  the 
course  of  our  analysis  we  give  the  technical  prerequisites  for 
tlip  foundation  of  the  technical  work  described  later. 

3.1.  The  Current  Research  Picture  Revisited 

Recalling  the  current  research  picture  from  Figure  2.3. 
we  see  that  the  intrinsic  parameters  that  will  be  described 
extensively  in  the  rest  of  this  section  are  computed  from  some 


particular  image  cue.  Indeed,  shading,  teiture.  contours, 
motion  and  stereo  are  very  important  cues  for  obtaining 
three-dimensional  information,  and  later  sections  will  show 
that.  If  we  look  carefully  at  the  research  picture  from  Figure 
2.3,  we  will  realize  that  an  intrinsic  parameter  is  computed 
using  only  a  particular  cue.  So  we  have  algorithms  for  shape 
from  shading,  shape  from  motion,  depth  from  stereo,  and  the 
like.  There  are,  however,  three  basic  problems  with  this 
approach. 

The  first  problem  has  to  do  with  employing  the  right 
assumptions.  Some  of  these  algorithms  are  based  on 
assumptions  which  despite  their  generality  are  not  fre¬ 
quently  present  in  the  real  world  and  so  these  algorithms 
fail  when  applied  to  a.  variety  of  natural  images.  An 
example  of  this  is  all  the  algorithms  for  the  computation 
of  shape  from  texture  [Wit k in ,  1981;  Stevens,  1979; 
Dunn  et  al.,  1983j.  In  these  algorithms  the  basic 
assumption  is  directional  isotropy  In  other  words,  it  is 
assumed  that  contours  and  line  segments  in  natural 
images  have  orientations  which  are  uniformly  distributed 
over  all  directions.  Obviously,  if  we  look  around  us  for 
natural  or  man-made  surfaces,  we  won't  find  that  this 
assumption  is  always  true.  Other  research  Aloimonos, 
19861  assumed  uniform  density  Again,  not  all  textures' 
have  this  property.  Of  course,  questions  such  as  what  is 
the  most  general  assumption  (i.e.  the  one  present  in  most 
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natural  textures)  are  legitimate,  but  their  answer  may 
come  from  an  empirical  analysis. 

The  second  problem  has  to  do  with  uniqueness  proper¬ 
ties  of  the  resulting  algorithms.  Some  of  the  problems  in 
Figure  2.3,  as  formulated,  cannot  have  a  unique  solution. 
So,  in  order  to  bring  down  the  space  of  all  solutions  to  a 
unique  point,  assumptions  are  made  about  the  world 
which  usually  are  not  general  enough  and  the  algorithms 
may  fail  when  applied  to  some  real  images.  An  example 
of  this  is  the  shape  from  shading  algorithms  [Horn,  1977; 
Ikeuchi  and  Horn,  1981;  Brooks,  1984]  that  use  assump¬ 
tions  about  the  smoothness  of  the  surfaces  in  view. 

The  third  problem  with  the  current  research  status  is 
the  one  which  has  to  do  with  the  robustness  or  stability 
of  the  resulting  algorithms.  Even  if  theoretical  analysis 
shows  that  given  the  constraints  at  hand  a  particular 
problem  has  a  unique  solution,  in  practice  it  turns  out 
that  the  solution  might  be  unstable.  In  other  words,  a 
very  small  error  in  the  input  might  result  in  a  catas¬ 
trophic  error  in  the  output.  An  example  of  this  is  some 
algorithms  that  compute  3-0  motion  from  retinal 
motion,  using  only  one  camera  [Tsai  and  Huang.  198); 
Longuet-Higgins  and  Prazdny,  1981;  Prazdny.  19M), 
19811.  The  basic  problems  with  the  current  research 
status  can  be  summarized  in  the  following  table. 

Table  1 

Problems  of  Current  Research  Status 


_ Problem _ 

Assumptions  should 
b»  employed  in  ord¬ 
er  to  solve  a  prob¬ 
lem.  Usually,  these 
assumptions  are  not 

general  enough. _ 

If  the  constraints  do 
not  guarantee 
uniqueness,  assump¬ 
tions  should  be  em¬ 
ployed  to  bring 
down  the  spare  of 
all  solutions  to  a 
unique  value.  Usu¬ 
ally  the  assumptions 
are  not  present  til 

the  world. _ 

Even  if  all  algorithm 
is  proved  to  have  a 
unique  solution,  the 
result ing  algorithm 
might  be  unstable. 


Example 

Shape  from  texture 
algorithms  [Wilkin, 
1981;  Aloimonos, 
1986|. 


Shape  from  shading, 
optic  flow  from  im¬ 
age  sequences  (Horn, 
19801 


fi-I)  motion  from  op¬ 
tic  flow  Isa i  and 
Huang,  198)!,  image 
reconstruction  from 
zero-crossings  and 
gradients  Hummel, 
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The  problem  of  theoretical  stability  analysis  in  visual 
algorithms  is  a  difficult  one  because  distributions  of  the  image 
intensity  function  are  necessary  for  such  a  (ask  Whenever 
possible,  we  perform  a  stability  analysis  of  the  introduced 
algorithms,  hut  we  haven'i  been  able  to  present  a  unified  sta¬ 
bility  analysis  There  is  such  work  in  progress  Aloimonos 
and  Spetsakis,  1987 


3.2.  The  Regularization  Paradigm 

One  of  the  best  definitions  of  early  vision  is  that  il  is  the 
inverse  of  optics,  i.e.,  a  set  of  computational  problems  that 
both  machines  and  biological  organisms  have  to  solve.  While 
in  classical  optics  the  problem  is  to  determine  the  images  of 
physical  objects,  vision  is  confronted  with  the  inverse  problem 
of  determining  properties  of  the  3-diinensional  world  from  the 
light  distribution  in  an  image,  or  a  dynamic  sequence  of 
images.  In  1923,  Hadamard  defined  a  mathematical  problem 
to  be  well-posed  when  its  solution: 

a)  exists, 

b)  is  unique, 

c)  depends  continuously  on  the  initial  data  (is  robust 

against  noise). 

Most  of  the  problems  in  classical  physics  are  well  posed,  and 
Hadamard  argued  that  physical  problems  had  to  be  well- 
posed.  However,  it  seems  that  inverse  problems  are  usually 
ill-posed.  Consider,  for  example,  the  equation  y  -  Ax.  where 
.4  is-a  known  operator.  This  equation  can  represent  optics, 
where  y  is  the  image,  A  is  the  imaging  process,  and  x  is  the 
world.  So,  in  this  case,  the  problem  is  to  determine  y  from  x . 
The  inverse  problem,  i.e.  find  x  from  y.  which  is  the  problem 
of  computer  vision,  is  usually  ill-posed  when  x,y  belong  to  a 
Hilbert  space. 

Most  of  the  early  vision  problem  are  ill-posed  (shape 
from  shading,  texture,  contour,  optic  flow  from  image  bright¬ 
ness  and  the  like).  Rigorous  regularization  theories  for  solving 
ill-posed  problems  have  been  developed  during  the  past  years 
[Tichonov  and  Arsenin,  1977|.  The  basic  idea  of  regulariza¬ 
tion  techniques  is  to  restrict  the  space  of  acceptable  solutions 
by  choosing  the  function  that  minimizes  an  appropriate  func¬ 
tional  The  regularization  of  the  ill-posed  problem  of  finding 
x  from  y  such  that  y  Ax  requires  the  choice  of  norms  j|  jj 
and  of  a  stabilizing  functional  ||/,/||.  Of  course  this  choice  is 
dictated  by  mathematical  considerations  and  most  impor¬ 
tantly,  by  a  physical  analysis  of  the  generic  constraints  of  the 
problem.  Then,  several  methods  can  be  applied,  as  for  exam¬ 
ple,  find  i  that  minimizes  ||.4.Y  y  ||*  +  X 1 1 /’j- 1|‘,  where  X  is 
the  so-called  regularization  parameter,  or  among  x  that 
satisfy  1 1 -r  1 1  <  k  .  where  k  is  a  constant,  find  x  that,  satisfies 
||/|j  y||  minimum,  etc.  Basically,  in  the  regularization 

paradigm,  a  low-level  vision  problem  that  is  ill-posed  lias  to 
be  regularized  by  imposing  additional  constraints  which 
should  be  physically  plausible.  An  example  of  such  an  addi¬ 
tional  constraint  is  some  kind  of  smoothness  of  the  unknown 
functions.  There  has  been  excellent  research  in  regularizing 
early  vision  problems,  originated  in  I’oggio  and  Koch.  IfK'i  . 
Even  though  the  regularization  paradigm  is  very  attractive 
for  ils  mathematical  elegance  and  for  being  a  legitimization  of 
already  published  research  Horn  anil  Schunek.  1981;  Ikeuchi 
and  Horn.  1981;  Hildrclh.  1981  .  it  has  some  shortcomings,  in 
the  sense  that  il  cannot  deal  with  the  lull  complexity  of 
vision  One  problem  is  the  degree  of  sinoot  linens  required  lor 
the  unknown  function  that  has  to  be  recovered;  lor  example, 
some  unrealistic  results  have  been  reported  m  surface  interpo¬ 
lation.  because  Jopi  h  discontinuities  are  smoothed  too  much 
Research  on  regularization  m  the  presence  of  diseonl  in  ml  ies, 
while  pioneering,  is  still  premature  Terznpoulos.  IHSti:  l.ee 
and  Ravi irl is,  1987;  Shtilman  and  Aloimonos,  1987;  Nagel 
1984  Another  problem  is  dial  standard  regulanzal  ion 
theory  deals  with  linear  problems  and  is  based  <  '  quadratic 
stabilizers.  In  the  ease  of  nonquadrat  ic  functionals  standard 
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regularization  theory  may  be  used,  but  the  situation  is  prob¬ 
lematic  [Morozov,  1984].  For  nonquadratic  functionals,  the 
search  space  may  have  many  local  minima  and  in  this  case 
only  stochastic  algorithms  by  Kirkpatrick  described  in  [Pog- 
gio,  1985]  might  have  some  success.  Thus,  it  will  take  the 
development  of  a  rigorous  theory  of  discontinuous  regulariza¬ 
tion  to  justify  the  possibility  of  the  regularization  theory  serv¬ 
ing  as  a  general  theory  of  many  low-level  vision  problems. 

3.3.  Results:  More  Information  from  Cooperative 
Sources  Yields  Unique  and  Reliable  Solutions 

Looking  back  at  the  current  research  status  diagram  of 
!  Figure  '2.3.  we  see  that  from  a  particular  cue  a  particular 

i  intrinsic  property  is  computed.  That  is,  no  cues  are  com¬ 

bined,  in  most  of  the  published  work,  to  recover  an  intrinsic 
image.  As  we  have  already  seen,  this  has  as  a  result  the  fact 
that  several  computations  do  not  have  uniqueness  properties 
(and  so  additional  assump'ions  are  needed  about  the  world) 
and  several  computations  that  have  uniqueness  properties 
under  ideal  conditions  break  down  in  the  presence  of  a  small 
amount  of  noise.  In  order  to  take  care  of  these  problems, 

1  more  information  is  needed.  In  particular,  if  we  combine 

1  information  from  the  different  image  cues,  then  s  .eral  com- 

•  putations  that  did  not  have  uniqueness  properties  might  now 

have  them,  simply  because  the  unknown  parameters  are  sub¬ 
ject  to  more  constraints  that  guarantee  uniqueness,  and 
1  several  computations  which  even  though  they  had  uniqueness 


properties  were  very  unstable  are  now  robust,  simply  because 
the  additional  constraints  do  not  let  the  solution  escape  from 
its  actual  position.  The  proposed  framework  for  the  compu¬ 
tation  of  intrinsic  images  is  given  in  the  following  Figure  3.1 
The  reader  should  compare  this  with  Figure  2.3  of  Section  2.3 
to  realize  that  new  information  is  combined  from  different 
cues  to  recover  the  intrinsic  parameters. 

It  is  worth  noting  that  very  recently  a  few  researchers 
have  realized  the  need  for  combination  of  information  from 
different  image  cues  for  better  estimation  of  intrinsic  parame¬ 
ters.  In  particular,  there  is  the  work  of  Horn  1 1 986]  for  com¬ 
bining  shading  with  contour,  the  work  of  Waxman  et  al. 
[Waxman  et  al..  1986!  for  combining  stereo  and  motion,  the 
work  of  Grimson  [Crimson,  1984;  for  combining  shading  and 
stereo,  the  work  of  Richards  and  Huang  and  Blostein  for 
stereo  and  motion  [Richards,  1985;  Huang  and  Blostein,  1985  , 
and  the  work  of  Milenkovich  and  Kanade  .Milenkovich  et  al., 
1985;.  So  the  need  for  such  an  approach  has  already  been 
realized  by  some  and  the  goal  of  this  work  is  to  contribute  to 
a  better  understanding  of  this  approach  and  that  it  will  gen¬ 
erate  more  related  research. 

The  basic  structure  of  the  paper  is  depicted  in  the  fol¬ 
lowing  diagram  (Figure  3.2),  In  I '.c  ellipses  (top)  are  the 
different  image  cues  (we  take  the  liberty  to  call  stereo  or 
motion  a  cue).  It  is  obvious  that  by  cue  wc  mean  a  source  of 
information,  either  coming  from  the  image(s)  or  from  the  par- 
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ticular  setup  or  condition  of  the  visual  system  (stereo-motion). 
In  the  squares  are  the  results  we  obtain  (in  terms  of  proposi¬ 
tions)  when  we  combine  information  from  two  different  cues. 
Two  or  more  different  cues  are  combined  with  arcs  which  lead 
to  small  circles  containing  a  plus.  Then,  a  different  arc  from 
the  plus  leads  to  a  square  containing  the  result  from  this  com¬ 
bination.  The  numbers  at  a  plus  or  an  arc  indicate  that  the 
theory  for  this  particular  computation  can  be  found  in  the 
corresponding  section  There  are  more  combinations  that  one 
can  make  and  the  corresponding  results  may  be  found  in 
[Aloimonos,  1986|.  Also  in  the  table  there  are  some  results 
that  are  not  stated  in  the  paper  but  they  are  easily  inferrable. 

3.4.  Mathematical  Preliminaries  and  What  is 
to  Come 

It  is  very  important  to  understand  how  the  images  are 
formed,  because  this  is  a  prerequisite  for  being  able  to  extract 
information  from  images.  There  are  basically  two  questions 
about  image  formation: 

a)  What  determines  where  the  image  of  some  point  will 
appear? 

b)  What  determines  how  bright  the  'mage  of  some  sur¬ 
face  will  be? 

Agreeing  that  it  is  very  important  to  know  how  an  image  is 
former!  in  order  to  analyze  it,  we  have  to  study  two  things: 
First,  we  need  to  find  the  geometric  correspondence  between 
points  in  the  scene  and  points  in  the  image,  and  second,  we 
must  find  out  what  determines  the  brightness  at.  a  particular 
point,  in  the  image  Assuming  that  the  reader  is  familiar  with 
the  concepts  of  perspective  and  orthographic  projection  as 
well  as  the  meaning  and  representations  of  shape,  depth,  reti¬ 
nal  and  3-D  motion,  reflectance  models,  and  the  like,  we 
describe  them  in  Appendix  A 

Once  again,  this  paper  does  not  try  to  present  a  unified 
theory  for  the  computation  of  the  intrinsic  images.  Much 
more  research  is  required  for  that,  and  the  last  section  sheds 
some  light  on  this  issue.  Instead,  it  proves  that  if  several  cues 
are  combined  and  if  the  right  (natural)  assumptions  are 
employed,  then  we  can  obtain  visual  computations  which 
uniquely  and  robustly  compute  intrinsic  image;.  We  address 
the  problems  of  shape  from  shading,  shape  from  texture,  and 
shape  and  3-D  motion  from  contour.  The  first  three  problems 
are  ill-posed  while  the  last  one  has  shown  to  be  very  difficult 
in  a  practical  situation  (unstable)  [Tsai  and  Huang,  198-lj. 
Section  -1  shows  that  if  shading  is  combined  with  motion,  then 
theoretically  a  unique  solution  is  attainable.  Section  5  shows 
that  if  texture  is  combined  with  motion  then  again  a  unique 
solution  is  guaranteed  Section  (>  analyzes  the  problem  of 
shape  from  contour  and  proves  that  if  the  information  from 
the  contour  is  combined  with  stereo,  then  uniqueness  is 
guaranteed.  Also,  it  is  shown  how  to  recover  the  3-1)  motion 
of  a  (lunar  contour  without  any  correspondence,  in  the 
discrete  case.  I.ater  sections  analyze  some  algorithms  from 
the  rest  of  the  cues  and  the  paper  concludes  with  a  .summary 
and  some  of  our  future  work 

4.  SHAPE  FROM  SHADING  AND  MOTION 

In  this  section  we  prove  that  if  we  combine  shading  with 
motion,  then  we  ran  uniquely  compute  the  direction  of  the 
light  source  and  the  shape  of  the  object  in  view  In  particu¬ 
lar 


(1)  We  develop  a  constraint  between  retinal  motion  displace¬ 
ments,  local  shape  and  the  direction  of  the  light  source. 
It  is  worth  noting  that  this  constraint  docs  not  involve 
the  albedo  of  the  imaged  surface  This  constraint  is  of 
importance  on  its  own.  and  it  can  be  used  in  related 
research  in  computer  or  human  vision 

(2)  We  develop  a  constraint  between  retinal  displacements 
and  local  shape  Again,  this  constraint  is  important  on 
its  own,  and  it  is  the  heart  of  the  algorithm  presented 
later  in  this  section. 

(3)  We  present  algorithms  for  the  unique  computation  of  the 
lighting  direction  and  the  shape  of  the  object  in  view. 

(■I)  And  we  present  some  experimental  results  that  test  the 
theory 

The  basic  assumption  in  this  section  is  that  the  retinal 
motion  is  computed  everywhere  in  the  image,  in  the  case  of  a 
moving  observer  and  a  stationary  scene,  or  a  stationary 
observer  and  a  moving  object.  If  several  objects  are  moving 
in  the  scene,  then  a  segmentation  is  required  first,  ie  the 
algorithms  developed  here  can  be  applied  to  one  rigidly  mov¬ 
ing  object. 

Shading  is  important  for  the  estimation  of  three- 
dimensional  shape  from  two  dimensional  images,  for  instance 
for  distinguishing  between  the  smooth  occluding  contour  gen¬ 
erated  by  the  edge  of  a  sphere  and  the  sharp  occluding  con¬ 
tour  generated  by  the  edge  of  a  disc  In  order  to  successfully 
use  shading,  one  must  know  the  illuminant  direction  /  This 
is  because  variations  in  image  intensity  (shading)  are  caused 
by  changes  in  surface  orientation  relative  to  the  illuminant. 
This  section  reviews  previous  approaches  to  the  solution  of 
the  determination  of  the  illuminant  direction  problem  and 
presents  a  new  method  for  the  unambiguous  determination  of 
the  single  lighting  direction,  and  from  that  of  the  shape  infor¬ 
mation.  In  particular,  this  part  of  the  paper  shows  that  if  we 
combine  information  from  shading  and  motion,  then  we  can 
uniquely  compute  shape  and  the  illuminant  direction.  Finally, 
in  this  section  we  are  making  the  assumption  of  orthographic 
projection,  since  reflectance  equation  models  are  not  known 
up  to  this  point  under  perspective  projection.* 

4.1.  Prerequisites 

The  ability  to  obtain  three  dimensional  shape  from  two 
dimensional  intensity  images  is  an  important  part  of  vision. 
The  human  visual  system  in  particular  is  able  to  use  shading 
cues  to  infer  changes  in  surface  orientation  fairly  accurately, 
with  or  without  the  aid  of  texture  of  surface  markings.  An 
example  in  which  shading  information  is  important  is  the 
change  in  luminance  that  distinguishes  a  smooth  occluding 
contour  (such  as  that  generated  bv  the  edge  of  a  sphere)  from 
a  sharp  occluding  contour  (such  as  that  generated  by  the  edge 
of  a  disc) 

Tile  direction  of  illumination  is  required  to  be  known  in 
order  to  obtain  accurate  three-dimensional  surface  shape  from 
two  dimensional  shading  because  changes  in  image  intensity 
are  primarily  a  function  of  changes  in  surface  orientation  rein 
live  to  the  illuminant  For  example,  small  changes  in  surface 
orientation  parallel  to  the  illuminant  direction  can  cause  large 
changes  in  image  intensity,  whereas  large  changes  in  surface 
orientation  that  occur  in  a  direction  perpendicular  to  the 

'Jtiil  recent  research  shows  that  the  model  might  be  the  same 
for  perspective  jShafer,  S  .  private  communication,  I9S7] 
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direction  of  illumination  will  not  change  image  intensity  at 
all  So,  the  illuminant  direction  must  be  known  before  one 
can  determine  what  a  particular  change  in  image  intensity 
implies  about  changes  in  surface  orientation.  In  the  section, 
we  develop  a  computational  theory  for  the  determination  of 
the  illuminant  direction,  and  the  shape  of  the  object  in  view, 
from  two  images  of  a  moving  object  (or  from  two  images 
taken  by  a  moving  observer).  Before  we  proceed,  we  should 
discuss  a  little  about  image  formation,  even  though  this  was 
discussed  in  Section  2,  in  general  terms. 

4.2.  Process  of  Image  Formation 

In  order  to  be  able  to  make  quantitative  statements 
about  the  world  and  the  image  and  specifically  to  estimate 
the  illuminant  direction,  we  must  use  a  mathematical  model 
for  the  image  formation.  A  great  deal  of  work  has  been  done 
in  this  area  [Horn  et  al.,  1975,  1979]  and  many  models  have 
been  developed.  For  the  purposes  of  the  section,  we  use  the 
following  simple  and  universally  accepted  mod  el  (see  Figure 
4.1).  Assuming  orthographic  projection,  if  n  is  the  surface 
normal  at  a  point  on  th  •  imaged  surface,  1  is  the  illuminant 
direction  and  /  is  the  flux  emitted  towards  the  surface  and  we 
assume  a  Lambertian  reflectance  function  for  the  surface 
[Horn  et  al.,  1975,  1979],  the  image  intensity  is  given  by 

/=#>/(  nl) 

where  p  is  the  albedo  of  the  surface,  a  constant  depending  on 
the  surface  For  more  on  the  subject,  see  Appendix  A. 


Figure  4.1:  Process  of  image  formation. 


4.3.1.  Motivation  and  Previous  Work 

Despite  the  fact  that  the  problem  of  determining  the 
illuminant  direction  is  important  for  computer  vision  (shape 
from  shading),  not  too  much  work  has  been  done  towards  its 
solu  lion. 

We  stress  here  the  fact  that  the  problem  of  the  determi¬ 
nation  of  the  illuminant  direction  is  important.  Most,  of  the 
work  in  shape  from  shading  [Horn,  1975;  Strat,  1979:  Ikeuehi 
et  al.,  198lj  assumes  that  the  albedo  of  the  surface  in  view 
and  the  illuminant  direction  are  known  <t  priori,  in  other 
words,  this  work  assumes  that  the  reflectance  map  specifies 
how  the  brightness  of  a  surface  parch  depends  on  its  orienta¬ 
tion.  under  given  circumstances  It.  therefore  encodes  infor¬ 
mation  about  the  reflecting  properties  of  the  surface  and 
information  about  the  distribution  and  intensity  of  the  light 
sources.  In  fact,  the  reflectance  map  can  be  computed  from 
the  bidirectional  reflectance-distribution  function  and  the 
light,  source  arrangements  as  shown  bv  [Horn  and  Sjobcrg. 
19791 
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When  encountering  a  new  scene,  we  usually  do  not  have 
the  information  required  to  determine  the  reflectance  map 
'let,  without  this  information,  we  are  unable  to  formulate  the 
shape  from  shading  problem,  much  less  solve  it. 

The  dilemma  may  be  resolved  if  a  calibration  object  of 
known  shape  appears  in  the  seen",  since  the  reflectance  map 
can  be  computed  from  its  image;  but  what  happens  when  we 
are  not  that  fortunate?  It  is  evident  from  the  above  discus¬ 
sion  that  at  least  the  knowledge  of  the  illuminant  direction  is 
required.  The  only  work  done  on  the  illuminant  direction 
determination  is  due  to  jl’entland,  1982.  1984;  Brooks,  1985; 
and  Brown  et  al.,  1983],  Pentland’s  method  is  based  -'r.  the 
assumption  that  surface  orientation,  when  considered  as  a 
random  variable  over  all  possible  scenes,  is  isotropically  distri¬ 
buted.  A  consequence  of  this  assumption  is  that  the  change 
of  surface  normals  is  also  isotropically  distributed.  I’entland's 
method,  which  uses  the  same  model  of  image  formation  that 
we  do,  is  valid  for  some  objects.  Under  his  assumptions. 
1'enlland  solves  the  problem  uniquely,  but  his  assumptions 
are  very  restrictive. 

On  the  other  hand,  Brooks  and  Horn  [1985j  presented  a 
method  in  the  general  framework  of  ill-posed  problems  and 
regularization  in  early  vision  Their  theory  proposes  to  solve 
the  shape  form  shading  problem  and  at.  the  same  lime  to 
compute  the  illuminant  direction,  by  minimization  of  an 
appropriate  functional  They  did  not  present  any  uniqueness 
or  convergence  proofs  of  their  iterative  methods,  but  their 
experimental  results  for  synthetic  images  were  reasonable. 

Finally,  [Brown  et  al.,  1983|  presented  a  method  that, 
based  on  Lambertian  reflectance  and  a  Hough  transform  tech¬ 
nique,  attempted  the  recovery  of  the  direction  of  the  light 


In  the  sequel,  we  prove  that  the  illuminant  direction  can¬ 
not  be  recovered  from  only  one  intensity  image  of  a  Lamber¬ 
tian  surface.  After  this,  we  will  prove  that  two  intensity 
images  (moving  object  or  moving  observer),  with  the 
correspondence  between  them  established,  can  uniquely 
recover  the  illuminant  direction,  and  from  that  the  shape. 

4.4.  A  Uniqueness  Result 

Here  we  prove  the  following  theorem. 


THEOREM  1:  C.i  ven  an  image  (i.c.  an  intensity  function 
I(x,y)),  there  are  an  infinite  number  of  surfaces  amt  an 
infinite  number  of  positions  of  the  light  source  that  will  pro¬ 
duce  the  same  image  under  the  process  of  image  formation 
described  in  Section  ft*. 


Proof:  Suppose  that  for  a  shape  n, (/.»/).  (H  is  the 

domain  where  the  image  function  is  defined)  and  a  light 
source  position  Sj  we  have  I{x,y)  pn((j.y)  where  p  is 
the  albedo  of  the  surface  in  view  (considered  constant  every¬ 
where).  Define  a  shape  over  U  and  a  light  source  posi¬ 
tion  s.,  as  follows:  n2(j,t/)  2m(n,(/.//  )m)  n ,(  j.//  ),  s2 

•’m(8|  m)  s j  for  any  vector  m  with  ||m|]  1 

Then,  considering  a  surface  with  the  same  albedo  as 
before  and  with  shape  n2(j,y)  and  illuminated  from  a  point 
source  in  the  direction  s.>,  we  have 
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P  n2(z,y )  s2  =  p(2m(n2(i.y)  m)  -  n^i.y  ))  (2m(s1  m)  -s,) 

=  p  [4(mm)(n,(;r,y  )  m)(s,  m)  -  2(n,(i,y)m) 

(8,  -  m)  -  2(s1-m)  +  n,(i,y  )  a,j 
^pn^tr.yjs, 

So,  I(x,y)  —  p  n2(i,y)'82. 

This  means  that  the  image  I{x,y)  could  he  due  to  an 


infinite  number  of  surfaces  illuminated  from  one  of  an  infinite 
number  of  light  sources,  since  the  vector  m  can  be  arbitrary, 
(q.e.d.) 

The  importance  of  the  above  theorem  ir-  tha*  "o  correct 
and  robust  method  can  exist  that  will  find  the  illuminant 
direction  from  one  intensity  image  of  a  Lambertian  surface 
illuminated  from  a  point  source. 

We  now  move  to  the  main  part  of  this  section,  that  is  a 
theorem  that  states  that  given  two  images  of  a  moving  object 
(or  two  images  of  the  same  object  taken  by  a  moving 
observer),  with  the  correspondence  between  the  two  images 
established,  the  position  of  the  light  source  can  be  uniquely 
determined.  But  before  that,  we  need  some  technical  prere¬ 
quisites  that  are  presented  in  the  following  section. 


4.5.  Technical  Prerequisites 

In  this  section  we  develop  two  technical  results,  one  con¬ 
cerning  the  relationship  between  shape,  intensity,  displace¬ 
ments  and  the  lighting  direction  and  the  other  concerning  the 
parameters  of  a  small  motion  (small  rotation)  with  the  shape. 
We  begin  with  the  following  theorem. 


THEOREM  2:  Suppose  that  two  views  frigid  motion )  of  the 
same  (Lambertian)  surface  (locally  planar)  a te  given  and  let  /, 
and  /2  be  the  two  intensity  functions.  Suppose  also  that  the 
displacement  vector  field  (u(x,y),  v(x,y)),  (i,y)Gfl  is  known, 
where  U  is  the  domain  of  the  image,  i.e.  a  point  (i,y)  in  the 
first  image  will  move  'o  the  point  (x  +  u(x,y),  y  +  v(x,y ))  in 
the  second  image.  If  the  lighting  direction  is  l  =  (UU-U)  and 
the  gradient  of  a  surface  point  whose  image  is  the  point 
(x,y)  is  (p,q),  the  following  relation  holds: 

p\l^u -/,(!+  A'v))a-r2/,2]  + 


2  pq  it'/iA1 1>  -  yi  +  A'u  ))(/,(l  +i  *u)-  I2A»  u  )  -  2r2/l/o]  + 

7sj(/, A’t>-/2(1  +  A* ««))*- r2/|  - 

2pl|/3r(r  -  ((1  +A'u)(l  +  A*v)  A'uA'i-))- 

2yl2/3r  (r  -  ((1  +  A1  u  )(1  +  A,i>)-A,uii'r))- 

((1  -t- A*u)(l  +  A't>)-A'«A'rf  + 

/ 2  ( ( A  *  u)2  +  (1  +  A"  u)2)  •+  if  ((A*  u)2  (1  +  A1  u  )2)  + 

^l,l2((l  4-  A1  u  )A'  v  +  Ayu(l  4  A*  v ))  - 

-  r2!2  -  2r/j  (( 1  A1  u  )(1  +A»i')-A'«A'r)=0  (1.1 ) 


where 


_  I2(x  +  u(x.y),  y  +  ufxjf )) 

lUr.y) 


and 
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u(x  +  1,  y)~  u(x,y) 
A v  u  =  u(x,y  +  l)-u(i,y) 
A'v  =  v(x  4-  1,  y)-v(x,y) 


A*  v  =  v(x,y  4-  1)  -  v(x,y ) 

It  is  clear  that  the  above  equation  (41)  is  local,  i.e.  it  involves 
the  gradient  at  a  point  (x,y),  the  displacements  around  the 
point  (x,y)  along  with  the  global  direction  of  lighting.  The 
above  constraint  is  a  conic  in  p,q  of  the  form 

Ap2  -t-  Bq2  4-  Cpq  -t-  Up  4-  Fq  -t-  =  0, 


where  the  coefficients  depend  on  local  displacements  and  the 
light  source  direction.  From  now  on  we  will  refer  to  equation 
(4-1)  as  the  lighting  constraint. 


Proof:  To  exploit  the  rigid  motion  assumption,  we  represent 
the  surface  normal  by  two  vectors  and  note  that  their  length, 
angular  separation  and  hence  their  dot  and  cross  product  are 
preserved  by  rigid  motion.  Consider  the  surface  5,  a  point  A 
on  S,  the  vector  n  =  p  i  4-  yj  4-  k  perpendicular  to  5  at  the 
point  A  ,  and  the  plane  II  that  is  tangent  to  the  surface  S  at 
the  point  A  (see  Figure  4.2).  The  vectors  a  —  (1,0,-p)  and 
b  =  (0,l,-y)  lie  in  fl  and 


aXb  =  pi4-yj-t-k  =  n 


Figure  4.2:  Shape  vectors. 


We  shall  use  vectors  a  and  b  as  the  shape  (surface  nor¬ 
mal)  representation  (Figure  4.3).  We  use  the  following  tradi¬ 
tional  camera  model.  Let  O  be  the  position  of  the  nodal 
point  of  the  eye,  let  OXYZ  be  a  coordinate  system  that  is 
fixed  with  respect  to  the  eye,  and  let  OZ  be  the  line  of  sight. 
Finally,  let  the  image  plane  be  perpendicular  to  the  Z-axis  at 
the  point  (0,0,1). 


Ur 


1a  «- 


1b 


(x.y  ) 


(x  +  l,y) 


Figure  4.3'  Displacement  vectors. 
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Consider  a  point  A(x,y)  on  the  surface  S  at  time  l 
whose  shape  vectors  are  a  =  vec(AB)  =  i  -  p  k  and 
b  =  vec(AC)  =  j  -  q  k  (vec(AB)  means  the  vector  from  the 
point  A  to  the  point  B ).  Since  the  projection  of  a  =  vec(.4B) 
on  the  image  plane  is  i,  we  conclude  that  if  IA  =  {x, y)  is  the 
projection  of  A  then  the  projection  of  B  will  be 
IB—(x  +  l,y).  Similarly,  the  projection  of  C  will  be  the 
point  Ic—(x,y  +  1)  on  the  image  plane.  Consider  now  the 
object  at  the  next  frame  where  the  point  A  will  become  A', 
with  shape  vectors  a1  —  A  'B'  and  b'  =  -4  'C". 

Let  IA i  be  the  projection  of  .4'.  IA>  is  the  position  to 
which  lz  moves,  which  can  be  determined  from  the  displace¬ 
ments.  Thus,  IAt=  (z  +  u(x,y),  y  +  v(x,y)).  Similarly,  if  Ig' 
and  [Ci  are  the  projections  of  B'  and  C',  then  IBi  is  the  posi¬ 
tion  to  which  Ig  will  move.  Of  course,  the  displacement  at  IB 
is  due  to  the  motion  of  the  surface  point  that  has  the  same 
x,y  coordinates  as  B  (orthography),  but  because  of  the 
assumed  local  planarity,  this  point  is  the  same  as  B  (The 
planarity  constraint  of  course  fails  at  boundaries.)  So,  we 
have  Ig:  —  (x  +  1  +  u(x  +  1,  y),  y  +  v(x  +  1,  (/)),  and  for  the 
same  reasons  /ci  =  (x  4-  u(x,y  -t-  1),  y  +  l v(z,y  +  l)).  The 
projection  of  a!  =  vec(  4  'B')  on  the  image  plane  is  thus 

Ia'1b'  =  [1  +  «(i  +  l,y)  -  «(*,»)!»+  \v(x  I-  l,y)  -  n(r,y)]j 

But  according  to  our  hypotheses,  the  above  relation  can  be 
written 

I  \'Ig'  ~  ( 1  4  A7  u  )l  4-  A7 

Similarly,  we  get 

I A ilc1  —  Ay  u  i  4  (1  +  A1  t>)j 

The  above  two  equations  give  us  the  expressions  for  IAdg:  and 
IA,IC,  which  are  the  projections  of  the  shape  vectors  a1  and  b' 
respectively.  But  then 

a'  =  (1  4  A7  u  )i  +  A1  nj  4-  Xk  and 

b'  =  A,iii  +  (l  + A"  e)j  4  pk  where  X,p  are  to  be  determined 

But  since  rigid  body  motion  preserves  the  vector  length,  we 
have 


X  =  ±(p2-  A'u  -  A1  v  -2*A  ‘u)Ui 
Similarly,  we  get 

p  =  ±(q2  -  Ay  u  -  Av  t>  -  2*AV  u  )1/2 
Assuming  that  neither  region  is  in  shadow,  we  have 

/,(-r,y)  =  p,  nA!  (4.2) 

and 

I-A*  4  u(j-.y).  y  -  >'(x,y))  -■  =  p2nA'I  (4.3) 

Equation  (1.2)  gives  the  intensity  of  (he  point  .4  (x.y)  in  the 
first  frame.  Note  that  nA  is  the  surface  normal  at  the  point 
,4  (in  the  first  view)  Equation  (4  3)  gives  the  intensity  at  the 
point  A'{x  +  u(i.y).  y  -t’(-r.y))  in  the  second  frame.  Note 
that  nA>  is  the  surface  normal  at  the  new  position  of  the  point 
.4 

Dividing  equations  (1.2)  and  (1.3)  and  setting 
12(A')  /  1  ( .4  )  -----  r  and  taking  into  account  that  p,  p2  (sur¬ 
face  markings  do  not  change),  we  get 

rnA  1  nA.  I  (  I  I) 


a  X  b 

“A  II*  X  b|| 

and  from  the  rigidity  of  the  motion  it  follows  that 

a'  X  b' 

"A'  lla'  X  b'll 


)ja  X  b||  =  ||a'  X  b'||  (4.7) 

Using  equation  (4.7),  equation  (4.4)  becomes 

r[a,b,l]  =  ±[a',b',l'j  (4.8) 

where  the  sign  chose  is  the  sign  of  [a',  b'.  kj  and  j,,J  is  the 
triple  scalar  product  of  vectors.  But  [a',  b',  k]  = 

(l-A'iijll  +  A't')  -AlluA*v.  It  is  obvious  that  [a.  b, 
k-  >  0;  if  [a',  b',  k]  <  0,  then  we  don’t,  have  a  valid  motion, 
because  (a',  b',  kj  <  0  means  that  we  have  reversed  orienta¬ 
tion  so  that  the  texture  in  the  image  is  viewed  as  if  seen  in  a 
mirror.  So,  (a',  b',  k]  >  0,  and  substituting  in  equation  (4.8) 
the  values  of  a,  b,  a',  b'  after  algebraic  manipulations  and 
using  the  fact  that 

X-  =  p2  4  1  -(1  4  A*  u  )2  ~  (A1 1’  )2 

(l'‘=f  4l-(l  4  Av  v )~-  (A1  u  )'*’ 

X/t  =  pq  -(1  +  A'«)A»«  -(1  4  Ay  v)(Az  v ) 
we  get  equation  (4.1). 

It  is  obvious  that  equation  (4.1)  involves  the  lighting 
direction,  but  it  also  involves  the  shape  (p,y).  We  would  like 
to  find  the  direction  1  without  knowing  the  shape  (p.y),  other¬ 
wise  the  problem  is  of  no  importance.  Theorem  2  is  very 
important,  in  the  sense  that  is  has  established  a  constraint 
between  lighting  direction,  shape  and  displacement  vectors. 

We  now  proceed  with  the  following  theorem. 

THEOREM  3:  Suppose  that  the  surface  S  (locally  planar )  in 
moving  with  a  rigid  motion,  and  the  camera  model  is  the  one 
described  in  the  previous  theorem.  Let  the  gradient  of  the  sur¬ 
face  (with  respect  to  the  first  frame j  be  (p(z.y),  q(x.y))  and 
the  displacement  vcct,.r  field  be  (i/(j.y),  i'(-r.y)).  It  is  known 
that  this  motion  can  be  consider,'  I  as  a  translation  (dx ,  dy . 
ih)  plus  a  rotation  of  an  angle  tl  ahout  an  axis  (rij.nj.n,)  pass¬ 
ing  through  the  origin  Inf  <  n.J  ■  nj  1)  If  the  rotation 
angle  0  is  small,  then  the  following  relation  holds: 

(a)  The  displacement  vector  field  (u(x,y).  v(x,y))  is  given  by 
u(x.y)  =  -  dx  4  Bx(x,i /)-  Cy 

v(x,y)  -dy  >  Cx  -tcfj.y). 

where  .4  —  nx0.  B  n  d),  (’  n:lfl  and  x(x.g)  the  depth 

of  the  surface  point  whose  projection  is  the  image  point 
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A*  «  -  A*  ti  4-  \/(A»u  4  A*c)g-4  <  A'u  «  A'u  .  , 

A_  =  AOr_ _ 2 _  '' 

B  A'u  A't- 


(a)  Trivial. 

(b)  Using  the  two  equations  in  (a)  and  the  assumption  about 
local  planarity,  we  get 

A'u  =  Bp 


A'u  =  C  -  Ap 


A'u  =  Bq  -  A 


A  v  v  —  -  A  q 


where  the  p  and  q  are  considered  at  the  point  (i,y). 
From  the  "bove  equations  we  get 


PUy)=  -fjA’u 


?(*.»)  = -7  A'u 

A 


A1  v  -  A1'  a  +  VIA' a  +  A'f)2-  4  *  A' a  *  A*  a 


A  _  A*  v 
B  A'u 


(q.e.d.) 


4.6.  Development  of  the  Lighting  Direction 
Constraint 


In  this  section  we  develop  the  constraint  that  will  be 
used  as  the  heart  of  the  algorithm  that  will  solve  the  problem 
of  the  determination  of  the  illuminant  direction.  If  we  let 
1  /A  ~a,  1/6  —  0,  B/A  —K  and  use  part  (bl  of  Theorem 
3,  to  substitute  in  equation  (11)  for  p  and  q ,  we  get  the  fol¬ 
lowing  equation. 


(A'u  )2d2\(l2A>  u  -  I ,( 1  +  A"  v  ))2  -  r  2f  f  ;  4 


-2A1  u  A"  vK d'KhA1  u  -/2)(1+A' u  ))(/,( 1  ~A'  v)-l2 A'  u )-2 
'2/,/2]  4-(A't>)2K2£2|(/,A'u-/2(l  +  A'u))2-  r2/2  ]- 


-2(  A'  u  )/J/,/3r  (r  -  ( 1  +  A'  u+A*  p  +  A'  u  A'  u  -A»  u  A'  n  ))- 


2(  A'  v  )K  dtj3r  (r-(UA‘#+A|,r+A'«A'r-A»t.A‘r)l- 


( 1  *  A'uA't'+A '  u  -  A1 1-  -  -A'  u  A'  u  )) 


/,2((A'o)2^(l  +  A*  u  )2)  ,  /  ?((A'  u  )2  -t  (1  -  A'u)2) 


U,((l  +  A'u)A'v  +A*u(l  -  A'u))  r2/,2  4 


2r/|(l  -  A'u  -  A'u  *  A'uA'u  -  A'u  A'u)  0  (I  9) 


The  above  equation  is  a  polynomial  in  l,.  I.,.  I3.  d- 

C'onsidering  equation  (-19)  in  four  points  we  get  a  poly¬ 
nomial  system  of  four  equations  in  four  unknowns  A  simple 
but  tedious  calculation  of  the  Jacobian  of  this  system  gives  us 
the  fact  that  the  Jacobian  has  rank  lour  (except  for  the 
degenerate  cases  whose  set  has  measure  zero)  This  means 
(inverse  function  theorem)  that  the  function  defined  by  tile 


equations  of  the  system  is  locally  an  isomorphism,  which 
means  that  its  zeros  are  isolated.  But,  from  Whitney’s 
theorem,  the  set  of  the  zeros  of  this  algebraic  system  is  an 
algebraic  set  and  it  has  finitely  many  components. 

The  conclusion  of  this  is  that  the  solutions  of  the  system 
are  finite  (uniqueness).  If  we  now  consider  equation  (4.9)  in 
five  points,  then  wc  get  a  system  of  five  equations  in  four 
unknowns  which,  barring  degeneracy,  will  have  at  most  one 
solution. 


It  is  clear  from  the  above  discussion  that  two  intensity 
images  of  a  Lambertian  surface,  with  the  correspondence 
between  them  established,  gi>’es  the  lighting  direction 
uniquely.  In  the  next  section,  we  present  a  practical  way  to 
recover  the  lighting  direction  based  on  lue  oiistraint 
developed  in  this  section. 


4.7.  The  Algorithm  for  Finding  Illuminant  Direction 

First  of  all  we  choose  the  Gaussian  sphere  formalism 
(azimuth-elevation)  to  represent  the  vector  that  denotes  the 
lighting  direction.  More  specifically,  we  set 


/ 1  =  cos<f>cos9 


l3  =  cos<j>cos9 


where  9  and  <t>  are  the  azimuth  and  elevation  (see  Figure  4.4) 
Now  we  consider  equation  (4.9)  in  n  points  in  the  images,  and 
we  get  n  equations  eq  1,  eq 2,  ....  eqn  in  the  three  unk¬ 
nowns  0,  9 ,  <t>.  Then,  the  following  algorithm  solves  the  prob¬ 
lem: 


for  all  9 
for  all  <t> 

{ 

get  n  quadratic  equations  in  i3 
Check  if  they  have  a  common  solution 
If  yes,  output  0  4>. 


t)  *  ^ 


Figure  4.4:  Gaussian  sphere. 


4.8.  Applying  the  Algorithm  to  Natural  Images 

When  one  is  experimenting  with  natural  images,  it  is 
sometimes  difficult  to  compute  the  displacement  field  for 
every  point  in  the  image.  In  that  ca.se.  one  can  compute  the 
parameters  of  the  correspondence  of  small  regions.  In  other 


words,  if  we  have  a  small  planar  region  in  the  first,  image 
that  corresponds  to  a  small  region  S.y  in  the  second  image, 
then  the  parameters  of  an  affine  transformation 
/(r.y)  ~(ai  +  by  + c ,  dx  t-  c y  f /)  between  the  two  patterns 
(see  Figure  -4.5)  can  he  computed  using  a  variation  of  a  least- 
squares  method  introduced  by  Lucas  and  Kanadc  that  is 
described  in  Webb  [1983].  In  that  case,  the  essential  con¬ 
straint  (equation  (-1 .9))  has  a  similar  form  and  the  whole 
analysis  proceeds  as  previously. 


Figure  4.5:  Corresponding  regions. 


4.9.  Implementation  and  Experiments 

We  have  implemented  the  above  mentioned  algorithm, 
and  it  works  .successfully  for  synthetic  images.  Figure  4.6 
represents  the  displacement  vector  field  that  was  obtained 
from  the  motion  (rotational  with  u.’,  I,  3,  3)  of  a 

sphere  Figure  4.7  represents  the  image  of  the  sphere  before 
the  motion.  The  surface  of  the  sphere  is  supposed  to  be 
Lambertian,  the  albedo  p  1  and  the  lighting  direction  with 
gradient  (ps.qs)  (0  7.  0.3),  i.e.  to  the  right  and  a  little  above 
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Figure  4.0  Displacements  field 

the  horizon.  The  computed  lighting  direction  was  (0.05,  O.TJ) 
The  observed  inaccuracy  is  due  to  discretization  effects.  In 
our  synthetic  experiment,  we  did  not  compute  the  displace¬ 
ments;  insteed,  we  calculated  them  since  we  knew  the  motion 
and  the  exact  position  of  the  sphere  If  the  Lambertian 
reflectance  model  is  not  adequate  for  capturing  natural  images 
(which  it.  is  not,  of  course),  there  probably  exists  a  model  (not 
discovered  yet)  that  captures  natural  shading.  This  model, 
should  of  course  depend  on  the  shape  and  the  lighting  direc¬ 
tum.  The  approach  that  we  took  here  can  be  taken  with  any 
other  model  of  reflectance,  and  it  is  one  of  our  future  goals  to 
apply  the  method  m  natural  images  and  employ  general 
reflectance  models,  that  consider  the  illumination  from  the 
sun  and  the  sky 

The  next  section  discusses  the  problem  of  determining 
shape  from  shading  and  motion  in  a  unique  way  Again  the 
findings  of  the  next  section  cannot  be  applied  at  this  point  to 


Figure  4.7.  intensity  image  for  a  sphere. 

natural  images,  for  the  reasons  that  we  mentioned  above. 
The  treatment  again  is  of  theoretical  value,  and  the  method 
can  be  applied  to  natural  imagery,  if  better  reflectance  models 
(i.e.  models  that,  capture  the  reality)  were  known,  and  the 
computation  of  retinal  motion  in  natural  images  were  feasible. 

4.10.  Computing  Shape  from  Shading  and  Motion 

In  this  section  we  discuss  the  problem  of  determining 
shape  from  shading  and  motion,  lie  fore  we  proceed  w  *•  need 
some  technical  prerequisites,  that  are  introduced  in  the  next 
sections. 

4.11.  The  Constraint  Between  Shape  and  Displace¬ 
ments 

THEOREM  4:  li  i//i  the  assumptions  and  notation  of 
Theorem  2,  the  gradient  (;>,</)  of  a  surface  point  whose  image 
is  the  point  (x,y),  with  displacement  vector  (n,r),  satisfies  the 
constraint 

Ap 2  i  /h/*  (’pg  I  I)  0,  with 

l  (A'uU.y))2  .  (AM/.y))2  *  2A*r(j.y) 

H  (A1  u(r.y))‘J  t  (AJ  i  ( j-.v ))2  I  3A'tt(.r,t/) 

('  Avu(s,y)  1  A”  «(/,(/ )A' «(/.(/ )  t  A'c(j-.y) 
i  A'  r(i,y  JA"  c(r.y) 

I)  C-  Ml. 

where  the  coefficient.*  A't(,  Av  u  ,  A'r,  Av  r  are  defined 
ill  Theorem 

Proof:  From  the  proof  of  Theorem  3.  because  of  I  he  rigid 
motion  assumption,  we  have  (he  preservation  of  (he  dot  |in>- 
due(  So, 

a  b  af  b' 

or 

l/tJ  i  lly:  -Cpy  I  l>  0  (  I  10) 

where  (/».</)  is  (he  gradienl  at  the  point  (.r.y)  and 
'  (Av  ii ( /,;/))-  t  ( Av  e(/.t/ ))'  i  3AV  r{x.y  ) 

H  (A'  « ( J-.t/  ))2  i  (AJ  e( x.y  ))3  t  3A'n(,r.y| 
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C  =  A'  u(z,p)  +  A'ufz.yJA'uU.y)  +  A1  v(z,y) 

+•  A'l'fj'.yJA11  r(i,y) 

D  =  C2-  AB 

Equation  (4.10)  gives  the  constraint  between  displace¬ 
ments  and  shape  and  represents  a  conic  section  in  p-q  space. 
This  conic  section  is  a  hyperbola  or  parabola  depending  on 
the  values  of  the  coefficients  A,  B,  C.  The  constraint  (4. 10) 
is  a  constraint  between  shape  and  displacements.  Constraint 
(4.1)  is  a  constraint  between  shape,  displacements  and  the 
lighting  direction.  For  the  purposes  of  the  rest  of  this  section, 
and  to  avoid  confusion,  we  will  refer  to  constraint  (4. 10)  as 
the  shape-motion  constraint. 

4.12.  How  to  Utilize  the  Constraints 

Here  we  show  how  to  utilize  the  constraints  to  recover 
the  three-dimensional  shape  of  the  object  in  view,  using  shad¬ 
ing  and  motion.  It  is  assumed  that  the  lighting  direction  has 
already  been  computed  with  the  algorithm  described  in  Sec¬ 
tion  4.7.  lip  to  now  we  have  developed  two  constraints  on 
shape,  that  also  involve  retinal  motion  displacements  and  the 
lighting  direction.  The  lighting  constraint  is  a  conic  on  p,q 
with  coefficients  that  depend  on  intensities  (relative),  displace¬ 
ments  and  lighting  direction.  The  shape-motion  constraint  is 
again  a  conic  on  p  and  q,  with  coefficients  that  depend  on  the 
displacement  vectors.  Finally,  the  image  irradiance  equation 

I  =  p/(nl)  (4.11) 


that  determines  the  intensity  I(x, y)  at  a  point  (x,y)  of  the 
image,  as  a  function  of  the  shape  n  of  the  world  point  whose 
image  is  the  point  ( x,y )  and  the  lighting  direction  1,  is  another 
constraint  on  p  and  q  that  is  also  a  conic.  This  constraint  we 
will  call  the  image-irradiance  constraint.  The  next  subsec¬ 
tions  will  describe  algorithms  for  the  unique  computation  of 
shape,  under  a  variety  of  situations. 

Figure  4.7.1  below  gives  a  geometrical  description  of  the 
constraints. 

4.12.1.  Computing  shape  when  the  albedo  is  known 
(n  u  merical  techn  iques) 

When  the  albedo  is  known,  then  we  have  at  our  disposal 
three  constraints  at  every  image  point  for  the  computation  of 
shape.  The  lighting  constraint,  let  it  be  Ft(p,q)  =  0,  the 
shape-motion  constraint,  let  it  be  F2(p,q)—0,  and  the 
image-irradiance  constraint,  let  it  be  F3(p,q)  —  0.  These  are 
three  equations,  all  of  degree  two,  and  their  system,  barring 
degeneracy,  will  have  at  most  one  solution.  Several  algebraic 
or  geometrical  techniques  exist  for  solving  a  system  of  alge¬ 
braic  equations,  each  of  degree  two.  Direct  methods  result  in 
solving  equations  of  a  high  degree,  and  so  we  prefer  to  use  an 
iterative  technique;  and  even  though  we  do  not  have  theoreti¬ 
cal  results  about  the  convergence  of  the  technique,  in  practice 
it  has  shown  to  converge  very  fast,  to  the  right  solution. 

The  function  E{p,q)  =  X1(F'1(p,</))2  +  XjfF^p,?))2  + 
X3(F3(p,ij))2  should  be  minimized  everywhere  in  the  image, 
where  Xlt  X2.  X3  are  constants  with  their  sum  equal  to  one.  If 


Figure  4.7.1 


Pictorial  description  of  the  constraints. 
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we  add  one  more  term  in  the  error  function  that  accounts  for 
smoothness  and  by  setting  the  partial  derivatives  of  E(p,q) 
equal  to  zero  and  solving  for  p  and  q ,  we  get  equations  of  the 
following  form:  p  =  G,(p,fl,p„„ ),  q  =  G2(p,q,qM),  where  6', 
and  C2  are  polynomials  in  p,q,pav  and  p,q,qM  respectively. 
These  equations  can  be  solved  iteratively,  provided  that  we 
have  an  approximate  initial  solution.  If  the  values  of  p  and  q 
at  the  boundaries  are  known,  then  p,q  are  propagated 
throughout  the  image  using  a  general  smoothness  criterion,  by 
an  algorithm  similar  to  the  Gauss-Seidel  algorithm  of  Ikeuchi 
and  Horn  [Ikeuchi  and  Horn,  1981)  At  this  point  we  should 
emphasize  that  we  do  not  depend  on  smoothness  to  achieve 
uniqueness  in  our  methods.  Smoothness  is  used  to  achieve  an 
approximate  initial  solution 

4.12.2.  Computing  shape  when  the  albedo  is  not 
known 

a)  Iteratively. 

If  the  albedo  is  not  known,  then  we  cannot  utilize  the 
image-irradiance  constraint,  because  it  contains  the  albedo  as 
a  coefficient.  We  have  to  use  the  lighting  constraint  and  the 
shape-motion  constraint.  An  algorithm  similar  to  the  one  in 
the  above  section  can  be  easiiy  obtained.  At  this  point,  the 
uniqueness  of  this  problem  has  to  be  discussed  The  lighting 
constraint  and  the  shape-motion  constraint  are  two  conics  in 
p  and  q  The  Jacobian  of  the  system  that  they  form  is  non¬ 
zero.  So,  the  function  defined  by  the  system  is  locally  an  iso¬ 
morphism  (from  the  inverse  function  theorem),  which  means 
that  its  zeros  are  isolated.  Hut  from  Whitney's  theorem,  the 
set  of  zeros  of  this  algebraic  system  is  an  algebraic  set  and  it 
has  finitely  many  components  The  conclusion  of  this  is  that 
the  system  hr;  finitely  many  solution.*  In  this  case,  the  solu¬ 
tions  might  be  restricted  to  a  unique  solution,  if  a  local 
smoothness  constraint  is  used.  This,  being  in  the  paradigm  of 
regularization  theory,  has  been  observed  from  experiments, 
and  up  to  this  point  we  do  not  have  a  formal  proof. 

b)  Directly 

In  a  minimization  scheme  based  on  the  Lagrange  multi¬ 
plier  technique,  the  solution  is  obtained  without  prepagat ing 
the  boundary  conditions.  If  F  {(p,q)  0  is  the  lighting  con¬ 
straint  and  h  j( p,q)  0  the  shape-motion  constraint,  the  error 
function  k(p,q)  {f‘i(pq)Y  is  to  be  minimized  subject  to 
the  constraint  F-,(p.q)  —  <)  The  Lagrange  multiplier  scheme 
says  that  the  p.q  that  minimize  F  are  one  of  the  solutions  of 
the  following  system: 

V/-'  XyL;.  /•'•>  0  where  X  is  th  e  Lagrange  multiplier 

4.12.3.  Implementation  and  experiments 

figures  18  and  19  represent  exactly  the  same  entities  as 
Figure  16  and  1.7.  From  this  input,  our  algorithm  (4.12  1) 
recovered  the  shape  shown  in  Figure  110  A  local  smoothing 
scheme  has  been  used  at  the  end  of  the  program  to  smooth 
out  the  results  'Flic  error  in  the  resulting  shape  is  verv  low, 
basically  due  to  discretization  elTecis  If  we  introduce  noise  in 
the  input,  then  the  results  get  very  much  corrupted,  if  wc 
don't  apply  a  snuxithing  scheme,  because  of  the  locality  of  the 
method  If  a  local  smoothness  constraint  is  utilized,  then  the 
i exults  are  very  gixxl 

4.13.  Conclusions  and  Future  Directions 

In  this  section  we  have  presented  a  theory  on  how  to 
compute  in  a  unique  way  shape  and  the  direction  of  the  light 


Figure  4.8:  Intensity  image  for  a  sphere. 
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Figure  4.9:  Displacements  field. 
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Figure  4.10  Keconsl ructed  shape 


source,  from  shading  and  motion,  under  orthographic  projec¬ 
tion.  Our  input  is  the  intensity  of  two  images  in  a  dynamic 
sequence,  with  the  correspondence  between  the  two  frames 
established.  We  proved  that  in  this  case  the  light  source 
direction  and  the  shape  of  the  object  in  view  can  be  uniquely 
determined.  Our  results  have  theoretical  value,  since  they 
demonstrate  that  if  we  combine  information  from  different 
sources  then  we  can  obtain  unique  results  for  intrinsic  images. 
In  the  past,  there  has  been  in  this  framework  only  the  work 
of  Crimson  [Crimson.  19841,  that  combined  shading  with 
stereo  with  reasonable  results.  It  is  one  of  our  future  goals  to 
extend  this  theory  to  capture  a  very  general  reflectance  map 
[Brooks  and  Horn,  1985|,  that  models  the  illumination  due  to 
the  sun  and  the  sky 

5.  SHAPE  FROM  TEXTURE  AND  MOTION 

In  this  section  we  study  the  problem  of  the  perception  of 
shape  from  texture  and  motion.  We  prove  that  a  monocular 
observer  that  moves  with  a  known  motion  can  uniquely 
recover  the  gradient  of  the  plane  in  view.  In  particular: 

(1)  Our  theory  does  not  require  the  finding  of  any 

correspondence. 

(2)  There  are  no  assumptions  about  the  texture. 

(3)  The  computation  of  spatial  derivatives  of  the  image 

intensity  function  (an  ill-posed  problem)  is  not  needed 

The  problem  of  shape  from  texture  has  received  a  lot  of 
attention  in  the  past,  few  years  and  some  excellent  research  on 
the  topic  has  been  published  [Stevens,  1979;  Wilkin,  1980: 
Davis  et  al  ,  1983;  and  Kanatani,  1981].  The  problem  is 
defined  as  “finding  the  orientation  of  a  textured  surface  from 
a  static  monocular  view  of  it."  This  problem  is  ill-posed  in  the 
sense  that  there  exist  infinitely  many  solutions.  To  restrict, 
the  spsce  of  solutions,  assumptions  have  to  be  made  about 
the  texture  Assumptions  such  as  directional  isotropy  and 


uniform  density  have  been  employed  in  previous  research 
f  iliform  density  has  been  defined  as  density  of  levels  or  den¬ 
sity  of  the  sum  of  the  lengths  of  the  contours  (zero- crossings) 
in  the  image. 

It  is  very  clear  that  ev<-n  though  some  of  the  assump¬ 
tions  used  in  the  literature  for  the  recovery  of  shape  from  tex¬ 
ture  are  general  enough,  they  are  not  powerful  enough  to  rap¬ 
ture  a  very  large  subset  of  natural  images.  As  a  result,  the 
developed  algorithms  fail  when  they  are  applied  to  many  real 
surfaces.  Furthermore,  there  is  no  way  to  check  hi  advance 
whether  or  not  a  particular  assumption  is  valid  for  the  surface 
that  is  imaged.  This  problem  alone  is  enough  to  demonstrate 
the  restricted  applicability  of  the  existing  shape  from  texture 
algorithms  (or  of  the  ones  yet  to  come).  We  will  show  that  if 
texture  is  combined  with  motion  then  the  shape  from  texture 
problem,  or  the  problem  of  shape  detection  from  surface 
intensity  and  markings,  becomes  easy,  in  the  sense  that,  no 
restrictive  assumptions  are  necessary  and  the  solution  is 
obtained  from  linear  equations. 

The  next  section  introduces  the  mathematical  prere¬ 
quisites.  For  simplicity,  and  without  loss  of  generality,  we 
will  assume  that  the  surface  in  view  is  planar  (as  in  previous 
shape  from  texture  reseaich);  see  Figure  5.1.  If  the  surface  in 
view  is  nonplanar,  then  the  problem  can  be  addressed  either 
by  applying  our  theory  locally  in  the  image,  i.e  assuming  that 
the  surface  in  view  is  locally  planar,  or  if  a  parametric  model 
for  the  surface  is  assumed,  then  the  same  basic  principles 
reported  here  may  be  used  to  recover  the  parameters  of  the 
surface  (with  the  difference  that  the  resulting  equations  might 
not  be  linear).  We  want  a  theory  for  shape  from  texture  that 
would  work  for  all  kinds  of  images,  such  as  the  ones  in  Fig¬ 
ures  5.2  5.1 
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Figure  5.1  Motion  of  a  textured  planar  surface 


Figure  5.2:  Leaves. 
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Figure  5.3  Coins. 


5.1.  Prerequisites 

The  treatment  and  symbolism  here  follow  those  in  Ito 
and  Aloimonos.  19S7  Suppose  that  the  camera  is  looking  at 
a  planar  surface.  Assume  further  that  the  camera  is  moving 
For  our  analysis  we  assume  that  the  surface  *s  moving  1  Ins 
is  equivalent  to  the  motion  of  the  camera,  and  it  is  done  here 
for  simplification  of  the  formulas  Call  the  planar  surface  in 
the  world  H  and  the  image  plane  If  Suppose  that  point 
X  (A.  V./)6  h  is  projected  onto  point  x  (i.y)C  If  l.ct 
the  motion  of  the  surface  consist  of  a  translation 
T  (f /;d  and  a  rotation  12  or 

V(X)  T  »  12  X  X.  where  V(X)  is  the  velocity  of  a  point 
X6  h  Then  this  velocity  can  be  written  as 


rtVt(X),  where 


(I  0  <>)' 
10  1  <>)' 


r,  r„  C,(X)  (0  0  l)r 

n  r4(X)  (0  /  >)r 

r,  ~c,.  C,(X)  ( /  0  .V)7 

r„  re(X)  (  )•  A  0)r 

Then,  it  can  be  easily  proved  that  the  optic  tlovv  (image  vek>- 

city)  at  a  point  x  (  j-. y )  is  x  rk  u*(x),  where  uk  is  a 

k  I 

function  depending  on  the  shape. 

We  prove  the  above  equation  avoiding  the  details  of  the 
perspective  projection  Let  the  projection  from  the  world  to 
the  image  plane  be  P.  with  P(X)  x  (j\y)  (P,(X). 

PL<X)).  If  the  shape  of  the  surface  If  is  given  (a  function  h). 
a  mapping  P^  i  If  »  (object  surface)  is  defined  such  that 

P(P*  .(X))  X 

The  optic-  flow  x  til  a  point  x  (J’.vl  is  t lion  given  In 


hi  ill 


1 1 !  i 

hi1 


(X,y)  =  4p(x)  =  l|-V(x) 


<?p,  di\ 

Tx  TV 
ap2  aPj 
~ax  TV 


ap,  ap, 

TV  1 Tz 
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Furthermore 


and  x  are  functions  of  X.  but  using  the 


inverse  projection  (from  the  image  to  the  surface),  we  can 
consider  them  as  functions  of  x  (retinal  coordinates).  Thus, 

-  |£-(P*-'(*))  and  V  V(Pk  i(x)). 


So,  since  V 


rkVk,  we  have 

fc  =  i 

8 

X  ---  E  rk  “*(*)  with 


Equation  (5.1)  will  be  used  very  frequently  in  the  sequel. 

5.2.  Linear  Features 

Here  we  introduce  the  concept  of  a  linear  feature  vector 
that  has  proved  to  be  a  strong  device  for  several  problems  in 
cybernetics  [Amari.  1986).  Let  the  image  intensity  function 
be  denoted  by  .s(j,y).  A  linear  feature  (LF)  is  a  linear  func¬ 
tional  /  over  the  image,  i.e. 

'  ff  s(x,y)m{i,y)di  dy . 

where  m  is  called  a  measuring  function,  s  is  the  image  bright¬ 
ness  and  the  integration  is  taken  over  the  area  of  interest.  A 
linear  feature  vector  f  (LFV)  is  a  vector  of  linear  features,  i.e. 

f  -  'if  if 2  fn]T  ■  with 
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Figure  5.4: 


where  rrq  is  a  measuring  function,  for  *  =  1 ....  ,  n.  {m,} 
could  be  any  set;  one  good  example  is  {m;}  =  {mpq}  = 
{ e ' (p'r  +  qy^},  in  which  case  a  linear  feature  corresponds  to  a 
Fourier  component  of  the  image. 

It  is  clear  that  a  linear  feature  is  a  global  image  feature, 
i.e.  a  global  characteristic  of  the  image.  Using  different 
measuring  functions,  we  get  different  global  measurements. 
The  hope  is  that  we  might  be  able  to  connect  properties  of 
the  3-D  surface  in  view  through  the  concept  of  the  linear 
feature  and  different  positions  of  the  camera.  In  that  case  we 
avoid  any  potential  correspondences.  The  following  section 
introduces  the  constraint  that  will  be  used  later. 

5.3.  The  Constraint 

Since  there  is  motion,  the  induced  optical  flow  satisfies 
the  following  equation  [Horn,  1986]: 


where  (u.ti)  is  the  optic  flow  at  a  point  (j.y)  and  sz ,  s„  s,  are 
the  spatiotemporal  derivatives  of  the  image  intensity  function 
at  the  point  (x,y).  This  equation  can  be  written  as 

ds 

—  —  -  x  V7S. 
dt 

The  time  derivative  of  an  LFV  will  be 


•  /„ ],  where 


/■  ff 


-m,  di  dy 


f  f  m,  (X  'V.' 


.«?  )dx  dy 


The  optic  flow  field  (from  equation  5.1)  can  be  written  in  the 
form 


■«*  (*)' 

,r*  ['•*(*) 


X  =  E  ftut  =  2 


Up  to  this  point  we  have  connected  the  optic  flow  field 
with  the  parameters  of  the  motion  of  our  camera  and  the 
shape  of  the  surface  in  view  (functions  ut)  In  the  sequel,  we 
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will  examine  the  temporal  derivative  of  a  global  feature,  i.e. 
how  it  changes  with  time. 

We  have 


fi  =  -  J*  J*  m,(x  ys)di  dy  or 


Si  =  -  £  rk  J  f  rn ,{ukax  +  vksl))dx  dy  or 


fi  =  £  with 

*  =  1 


A,t  ~  J  J  m^u‘Sl  + 

So,  we  have  found  that 

f=ffr,  (5.2) 

where  //  =  (/>,*)  and  r  =  (r,r2  •  r6)T,  the  motion  parame¬ 

ters. 

We  have  been  rather  abstract  up  to  now.  since  we 
wished  to  emphasize  that  the  theory  does  not  depend  on  the 
projection  used  as  well  as  the  simplify  formulae.  With  per¬ 
spective  projection  (Fig.  5.1)  the  parameters  in  equation  5.1 
T i  Tq  T 3 

are  rj  = - ,  r2  = - ,  r3  = - ,  r4  =  u>„  rs  =  w2,  r6  =  o>3, 

c  c  c 

u,(x)  =  (1  -px  -  ?y,0),  u2(x)  =  (0,  1  -px  -  qy),  u3(x)  = 
(-x(l-pz  -qy),  -y(l  -  px  -  qy)),  u4{x)  =  (zy,  y2  +  1),  us(x) 

=  ( -  (x2  +  1),  -  xy)  and  u6(x)  =  (y,  -  x),  where  the  motion  is 
translation  T=(T1,  T 2,  Tf)  and  rotation  Q  —  (wlf  w2,  w>3) 
and  the  plane  in  view  has  equation  Z  —  pX  +  qY  +  c,  with 
respect  to  the  camera  coordinate  system  (Fig.  5.1). 

Matrix  H  contains  the  parameters  of  the  plane  in  a 
linear  form.  So,  equation  (5.2)  relates  linear  features  with 
shape  and  motion  parameters.  Furthermore,  it  is  linear  in  the 
shape  of  the  planar  surface  in  view.  So,  a  simple  linear  least- 
squares  method  or  a  Hough  transform  technique  is  sufficient 
for  the  recovery  of  the  gradient  of  the  plane  in  view.  Depth 
can  be  computed  too,  if  desired.  Here  we  must  emphasize  the 
fact  that  no  local  correspondence  has  been  used.  The  only 
computed  quantity  was  the  time  derivative  of  a  linear  feature 
vector,  that  involves  the  whole  image. 

Finally,  we  want  to  stress  here  the  fact  that  in  this  algo¬ 
rithm,  the  spatial  derivatives  of  the  intensity  function  don’t 
need  to  be  computed.  This  is  due  to  the  linear  feature  vector 
approach.  (Integration  by  parts  avoids  differentiation  of  the 
intensity  function.  Instead,  the  derivative  of  the  measuring 
function  has  to  be  computed.  So,  we  avoid  differentiating  the 
image  intensity,  which  is  discrete,  because  numerical 
differentiation  is  an  ill-posed  problem.)  More  importantly,  the 
same  approach  can  be  followed  if  the  image  is  a  dot  pattern 
(or  a  line  pattern — zero  crossings),  i.e.  it  is  discontinuous 
The  reason  for  this  is  again  the  fact  that  the  spatial  deriva¬ 
tives  of  the  intensity  function  don’t  have  to  be  estimated. 
Only  the  temporal  derivative  of  the  image  needs  be  estimated. 
This  approach  has  been  initiated  in  [Ito  and  Aloimonos,  I987j. 
To  further  clarify,  suppose  that  the  image  consists  of  a  set  of 

points  (dot  pattern),  (zq.y,),  i  -----  1 . n.  Then,  the  mov- 

n 

ing  image  can  be  represented  as  s(x,y.t)  --  £  6(x  x,(/), 

i-  1 


y  -&(<))■  In  that  case  a  linear  feature  fk  is  trivially  comput¬ 
able.  So,  this  theory  works  for  both  differential  and 
nondifferentia!  image  functions.  To  summarize  the  algorithm 
for  the  detection  of  shape  from  texture  from  two  images 
s(x,y,t)  and  s(x,y,  t  +  dt)  taken  by  a  camera  with  known 
motion,  the  following  are  the  successive  steps  of  our  formula¬ 
tion. 

1.  The  input  is  a  sequence  of  images  s(x,y,t)  and  s(x,y, 
t  +  dt),  taken  with  known  motion. 

2.  Choose  a  set  of  measuring  functions  /i,(x,y), 
i  =  l,...,  n.  These  are  smooth  differentiable  func¬ 
tions  such  as  Xj  y, ,  0  <  i,j  <  k ,  where  k  is  a  constant,  or 
Fourier  features  such  as  cos  (px  +  qy)  where  the 
coefficients  p,q  are  determined  in  a  random  manner. 

3.  Create  a  linear  feature  vector 

f  =  I/i/2  '  fn  I  where 


(x,y)fi,(x,y)di  dy, 


with  the  integration  taken  over  the  area  of  interest. 

4.  Compute  the  change  of  the  linear  feature  vector  f,  i.e. 
the  time  derivative  f  from  the  images  s(x,y,/)  and  s(x.y, 
t  +dt). 

5.  Compute  expressions  for  the  entries  of  matrix  H  For 
example: 


hi3=  ~  J  J  m,(u3sx  +  v3sv)dz  dy  or 


hi3=-J  J  mi(z(pz  +qy-\)sx+y(px+qy-\)sy)dx  dy 

6.  From  equation  f  =H  r  .  olve  for  p,q  (and/or)  c  with  a 

least  squares  method. 

5.4.  Experiments 

We  have  carried  out  experiments  with  synthetic  and 
natural  images.  The  pictures  were  processed  by  a  VICOM 
processor  and  the  motio-  was  controlled  using  a  Merlin  Amer¬ 
ican  Robot  arm.  We  will  only  present  experiments  with  real 
images.  Figure  5.5  shows  the  image  of  a  drawing  that  was 
made  with  a  marker  on  a  white  board.  Figure  5  6  shows  the 
image  of  the  same  scene  after  the  motion  of  the  camera.  The 
orientation  of  the  drawing  with  respect  to  the  camera  coordi¬ 
nate  system  was  (ground  truth)  p  =  0  4  and  q  =  0  0  It  must 
be  emphasized  that  estimating  the  ground  truth  must  allow 
for  an  error  of  about  3-4  degrees  in  slant  and  tilt  The 
recovered  parameters  in  this  case  were  p  ~  59  and 
q  —  -0  08.  Figures  5.7  and  5.8  show  the  images  of  a  planar 
dragon  (poster),  before  and  after  the  motion,  respectively 
Again,  the  ground  truth  was  p  q  0  and  the  recovered 
shape  was  p  --  0.001  and  q  -  0  0003. 

Finally,  the  next  experiment  was  designed  in  order  to 
avoid  the  ambiguities  that  arise  in  the  ground  truth  measure¬ 
ments  In  this  experiment  we  measure  the  change  in  shape, 
not  the  actual  shape.  Figures  5  9  and  5  10  show  the  images  of 
a  piece  of  cloth,  before  and  after  a  small  motion  Figures  5.11 
and  5.12  show  again  the  images  of  the  cloth,  after  the  camera 
has  considerably  rotated  in  order  to  change  the  orientation  of 
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Figurc  5.11:  Cloth  (rotated  camera),  before  motion 


Figure  5.12:  Cloth  (rotated  camera),  after  motion. 


the  cloth.  We  measure  the  two  surface  normals  and  then  we 
compute  the  angle  between  the  two  normals,  whose  ground 
truth  is  known  from  the  rotation  of  the  camera.  In  this 
experiment  the  ground  truth  for  the  angle  between  the  sur¬ 
face  normals  was  0  15°  and  the  estimated  one  was 

0  -  15  -16”. 


5.5.  Robustness  Analysis 

Kquations  (5.2)  will  finally  result  in  a  linear  3  X  3  system 
in  the  unknowns  p,q  and  c  of  the  parameters  of  the  plane  in 
view.  It  might  happen,  however,  that  this  system,  let’s 
denote  it  by  A  x  r,  where  x  (p  q  r)T  and  .1  ( <itJ )  is  a 

3X3  matrix  and  c  a  1X3  vector  whose  components  are 
expressions  of  spatiotemporal  derivatives  of  the  image  inten¬ 
sity  functions.  The  fact  that  we  just  have  to  solve  a  simple 
linear  system  does  not  mean  that  its  solution  will  be  neces¬ 
sarily  stable.  Since  there  is  discretization  error  as  well  as  a 
slight  error  in  the  estimation  of  the  known  motion,  there  is 
some  uncertainty  in  the  entries  of  the  matrix  A  and  the  vec¬ 
tor  c,  the  true  values  not  being  known  exactly.  If  the  true 
system  corresponding  to  the  above  is  A *  /'  c#,  let’s  sup¬ 

pose  that  the  coefficients  and  the  right  hand  constants  of  the 


true  system  of  equations  are  known  no  more  precisely  than  is 
given  by 

s' € K  k f«l 

and 

<•.’€  |c,  -  ,  c,  +€,•] 

where  the  quantities  cj;,  t,  are  nonnegative.  We  call  these 
quantities  uncertainties.  If  there  exist  values  of  the 
coefficients  ay  in  the  intervals  of  uncertainly  for  which  the 
determinant  of  the  system  becomes  zero,  then  the  system  is 
very  badly  conditioned  and  it  shouldn’t  be  solved,  because 
whatever  the  solution  is,  it  will  be  very  unreliable.  In  the 
sequel  we  will  develop  a  condition  for  this  kind  of  ill¬ 
conditionedness  to  occur  Let’s  first  write  the  true  system  of 
equations  A‘  ?*  — •-  c*  as 

(/I  4  SA  )(z  4-  S?)  c  4  Sc 

where  A,  x  and  c  refer  to  the  approximate  system  of  equa¬ 
tions.  Let’s  also  write  A/1  =  («,y)  and  \6A  |  -  (|f>ay|)  where 
SA  —  (6a,y).  Then  in  order  to  have  stability  it  is  necessary 
and  sufficient  that  A  +  SA  be  nonsingular  for  changes  in  the 
coefficients  within  the  limits  of  the  uncertainties,  i.e.  we  must 
have 

det(/4  +6A)ft  0  (5.3) 

whenever 


\SA  |  <  A/1 


i.e.  whenever  |i«,y|<(y.  Hut  the  determinant  is  a  function 
of  the  coefficients.  So,  writing  a  det(/t )  and 
a  +  6a  del(/t  4  SA  ),  the  change  ho  in  the  C.  terminant  a  is 
given  to  the  first  order  of  small  quantities  by 
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i  = —  1  J  —  1  ^ aij 


(51) 


Then,  taking  the  modulus  we  have 

l*«.>  I 

and  the  equality  is  reached  in  the  above  equation  when  ^ 

'Li,; 

and  6n,j  are  always  of  the  same  sign. 

Let’s  further  denote  by  A  a  the  least  upper  bound  of  the 
absolute  change  in  value  of  the  determinant  within  the  limits 
of  the  uncertainties  in  the  coefficients.  Thus,  within  the  lim¬ 
its  a  change  in  value  of  the  determinant  as  large  as  An  in 
absolute  value  is  possible,  but  a  larger  value  is  not  possible 
So, 
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The  expression  on  the  right  hand  side  or  the  above  equation  is 
an  upper  bound  of  |6ri  |  for  |  <  ( tJ ,  and  t  his  expression  is 
the  least,  upper  bound.  Now,  instability  (critical  ill- 
conditioning)  means  that  the  determinant  of  the  coefficient 
matrix  can  become  zero  within  the  limits  of  the  uncertainties 
in  the  coefficients  But  the  true  value  of  the  determinant, 
from  the  preceding  analysis,  is  contained  in  the  interval 
|a  An,  a  ♦  An],  and  so  the  necessary  and  sufficient  condi¬ 
tion  for  no  critical  ill-conditioning  is  that  this  interval  does 
not  contain  zero,  i.e  An  •  |u|,  whatever  the  sign  of  a.  So, 
the  necessary  and  sufficient,  condition  for  no  critical  ill- 
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conditioning  is 


Aa  <  |a| 


or 
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On  the  other  hand,  we  have 

da.j 

for  i,j  =  1,  .  .  .  ,  3,  where  are  the  minors  of  the  deter¬ 
minant  of  the  matrix  A  .  But  it  is  trivial  to  see  that 

(-1 

a 


are  the  elements  of  the  transpose  of  the  inverse  of  A  ,  i.e. 


\)'+1  M,j 


U  =1,2,  ....  3 


where  B  =  ( A;y )  =  /I 


Hence,  from  this  analysis,  the  necessary  and  sufficient 
condition  for  the  system  not  to  be  critically  ill-conditioned 
(and  therefore  for  its  solution  to  be  stable)  is 


EEM«tf<l  (5.5) 

i  =  U  =  l 

It  has  to  be  noted,  however,  that  this  condition  is  approxi¬ 
mate,  since  we  assumed  that  (5.4)  is  true  (first  order);  but  this 
approximation  is  not  very  far  from  reality,  since  the  deter¬ 
minant  is  a  smooth  function  of  the  elements  of  the  matrix. 
We  now  proceed  to  apply  the  findings  of  this  section  to  our 
problem.  A  theoretical  error  analysis  seems  very  hard,  since 
we  would  need  specific  distributions  for  the  image  intensity 
function.  However,  we  can  examine  if  (5.5)  becomes  true  for 
several  instances  of  our  problem. 

Introducing  discretization  error  (ie.  retinal  coordinates 
x,y  have  uncertainties  of  at  most  1/2  pixel),  allowing  small 
errors  (up  to  20%)  in  the  known  motion  parameters,  and  gen¬ 
erating  random  intensity  images,  we  found  from  simulation 
that  condition  5.5  was  always  true,  i.e.  we  never  had  critical 
ill-conditioning  as  also  the  previous  experimental  results  indi¬ 
cated.  However,  a  theoretical  error  analysis  is  still  needed. 


5.6.  Summary  and  Discussion 

We  have  presented  a  method  for  the  recovery  of  shape. 
Our  method  does  not  rely  on  any  assumptions  about  the  tex¬ 
ture  and  it  does  not  require  the  image  to  be  spatially 
differentiable.  The  approach  is  based  on  the  fact  that  the 
observer  is  moving  with  a  known  motion.  If  the  observer  is 
moving  with  an  unknown  motion,  then  again  the  problem  is 
solvable,  and  it  has  been  addressed  by  Aloimonos  [19m>  md 
Neghadaripour  (1985], 

6.  SHAPE  AND  3-D  MOTION 
FROM  CONTOUR  AND  STEREO 

In  this  section  we  study  the  detection  of  surface  shape 
and  three-dimensional  motion  from  the  perception  of  a  planar 
contour.  We  prove  that  a  binocular  observer  can  compute 
the  orientation  and  the  3-D  motion  of  a  moving  contour 
without  using  point  to  point  correspondences.  In  particular: 


(1)  We  develop  constraints  between  the  coordinates  of  the 
points  that  constitute  the  contours  in  the  left  and  right 
retina  of  a  binocular  observer  that  enable  him  to  detect 
the  structure  and  depth  of  the  plane  in  view  without 
using  any  point  to  point  correspondences. 

(2)  We  develop  constraints  between  the  lengths  and  the 
areas  of  the  contours  in  the  left  and  the  right  retina  of  a 
binocular  observer  that  enable  him  to  compute  the  struc¬ 
ture  and  the  depth  of  the  plane  in  view  without  any 
point  correspondences. 

(3)  We  discover  constraints  between  the  retinal  motion  of 
the  contour  and  its  three-dimensional  motion  that  make 
it  possible  to  recover  3-D  motion  without  any  correspon¬ 
dences. 

(4)  And  finally  we  generalize  some  of  the  above  results  for  a 
monocular  observer.  In  particular,  a  translating  mono¬ 
cular  observer  can  recover  the  shape  of  an  imaged  con¬ 
tour  without  using  any  point  to  point  correspondences. 

The  basic  assumption  here  is  that  the  contours  in  the 
left  and  right  images  have  been  found  and  the  correspondence 
between  them  has  been  established. 

8.1.  Introduction 

The  human  perceiver  is  able  to  derive  enormous  amounts 
of  information  from  the  contours  in  a  scene.  As  part  of  this 
capacity,  we  are  able  to  use  the  shapes  of  image  contours  (as 
they  are  seen  by  both  eyes)  to  infer  the  shapes  and  disposi¬ 
tions  in  space  of  the  surfaces  they  lie  on,  as  well  as  their 
motion.  To  the  extent  the  inferences  we  draw  are  accurate, 
our  strategies  for  drawing  them  must  have  some  basis  in  the 
character  of  the  visual  world,  just  as  the  efficacy  of  stereopsis 
as  a  source  for  depth  information  has  a  basis  in  the  geometry 
of  projection  and  triangulation.  The  aim  of  the  research 
described  here  is  (I)  to  discover  constraints  on  the  visual  world 
that  allow  surface  shape  and  motion  to  be  reliably  inferred 
from  contours  in  images,  (2)  to  derive  methods  of  inference 
from  these  constraints.  The  interpretation  of  contours  by  a 
binocular  observer  falls  into  four  subproblems  (following  [Wit- 
kin,  1981]).  In  particular  these  four  subproblems  are  the  fol¬ 
lowing: 

a)  Locating  contours  in  the  images. 

If  contours  are  to  be  used  to  infer  anything,  they  must 
be  found.  The  human  perceiver  has  little  difficulty  deciding 
what  is  and  is  not  a  contour,  yet  the  automatic  detection  of 
edges  has  proved  very  difficult.  Perhaps  this  fact  should  not 
be  surprising;  the  contours  that  we  see  in  natural  images  usu¬ 
ally  correspond  to  definite  physical  events,  such  as  shadows, 
depth  discontinuities,  color  differences  and  the  like.  Our  abil¬ 
ity  to  detect  these  events  may  say  more  about  their 
significance  for  image  interpretation  than  about  their  ease  of 
detection.  Why  should  we  expect  events  that  have  simple 
descriptions  in  terms  of  the  structure  of  the  scene  to  have 
simple  descriptions  in  terms  of  the  image  intensity  as  well?  If 
the  physical  significance  of  contours  is  taken  as  their  primary 
feature,  then  at  least  we  know  what  is  being  detected,  even  if 
we  don’t  know  how.  But  recent  research  shows  that  we  are  in 
pretty  good  state  as  far  as  detection  of  contours  goes.  Actu¬ 
ally,  we  can  say  that  we  can  fairly  well  detect  the  contours  in 
an  image,  even  if  there  are  some  inaccuracies. 

b)  Labeling  contours  (i.e.  distinguishing  contours  which  are 
due  to  different  physical  events) 
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If  contours  correspond  to  different  physical  events,  then 
an  essential  component  of  their  interpretation  must  be  to 
decide  which  contours  denote  which  event,  since  each  kind  of 
contour  imparts  a  different  meaning.  Recent  work  has  shown 
that  strong  structural  constraints  can  be  applied  to  distin¬ 
guish  one  kind  of  contour  from  another. 

c)  Corresponding  contours  ( i.e .  finding  which  contours  in 

the  left  and  right  images  are  the  image  of  the  same  3-D 

contour) 

Before  we  apply  some  interpretation  method  to  the 
images  of  the  contours  (left  and  right),  we  should  know  which 
contours  in  both  images  correspond  to  each  other,  i.e.  they 
are  the  images  of  the  same  three-dimensional  contour. 

d)  Interpreting  contours 

Even  after  contours  have  been  found,  labeled  and  the 
corresponding  ones  in  the  left  and  right  images  have  been 
identified,  not  much  is  known  about  the  physical  structure  of 
the  scene,  if  we  don’t  wish  to  resolve  in  a  point-to-point 
correspondence  between  the  left  and  right  images.  It  is  clear 
that  contours  play  an  important  role  in  the  human  perceiver’s 
ability  to  decide  how  things  are  shaped  and  where  they  are, 
apart  from  the  application  of  specific  “higher  level” 
knowledge  to  objects  of  known  shape.  This  research 
addresses  this  fourth  problem,  i.e.,  given  the  left  and  right 
image  of  a  moving  planar  3-D  contour,  to  recover  its  orienta¬ 
tion,  depth  and  3-D  motion,  without  using  any  point-to-point 
correspondence,  neither  between  the  left  and  right  images  nor 
between  the  dynamic  frames.  The  reason  that  we  want  to 
solve  the  problem  without  using  point  correspondences  is  that 
correspondence  is  a  very  hard  problem  and  it  does  not  seem 
tractable  with  the  available  tools.  So,  we  would  like  to 
address  the  problem  in  such  a  way  that  we  avoid  the 
correspondence  problem. 

6.2.  Motivation 

This  research  is  motivated  by  the  inherent  difficulties  of 
the  conventional  stereo  problems  as  well  as  the  difficulty  of 
the  dynamic  correspondence  problem  (to  recover  optic  flow  or 
discrete  displacements,  that  will  be  used  for  the  recovery  of 
3-D  motion). 

Passive  ranging  by  triangulation  methods,  which  is 
employed  successfully  by  humans  under  certain  conditions, 
has  received  much  attention  in  the  computer  vision  literature 
in  recent  years.  It  is  obvious  that  the  ability  to  recover  abso¬ 
lute  range  to  objects  in  a  scene  would  be  important  in  a 
variety  of  robotic  applications.  To  date,  only  two  basic 
methods  of  passive  ranging  have  been  reported,  “static 
stereo,”  i.e.  the  use  of  two  cameras  separated  by  a  known 
baseline,  and  “motion  stereo,”  i.e.  the  use  of  a  single  camera 
moving  in  a  known  way  through  a  stationary  scene. 
Recently,  a  new  concept  has  been  introduced  for  passive  rang¬ 
ing  to  moving  objects,  termed  “dynamic  stereo,”  which  is 
based  on  the  comparison  of  multiple  image  flows.  In  the 
sequel,  we  will  only  deal  with  the  criticism  of  the  first  method 
(static  stereo).  Most  of  the  literature  on  passive  ranging  has 
been  concerned  with  the  difficult  “correspondence”  problem 
associated  with  the  assignment  of  stereo  disparities  (for  the 
static  stereo  method).  Beside  the  traditional  method  of  inten¬ 
sity  correlation  between  images,  much  attention  has  been  paid 
to  the  theory  of  Marr  and  Poggio  (1977,  1979],  with  imple¬ 
mentation  by  Grimson  The  use  of  more  than  two  camera 
locations,  to  aid  in  solving  the  correspondence  between 


images,  has  been  approached  in  different  ways  by  some 
researchers.  Nevertheless,  the  solution  of  this  correspondence 
problem  remains  a  computationally  expensive  and  slow  pro¬ 
cess,  with  partial  success  in  a  variety  of  input  images  More¬ 
over,  a  maximum  ranging  distance  is  implied  by  the  finite 
resolution  of  the  cameras  and  the  statically  configured  base¬ 
line  between  cameras.  Most  of  the  work  needed  to  solve  the 
correspondence  problem  deals  with  the  matching  of 
microfeatures,  such  as  points  of  interest  (corners,  high  curva¬ 
ture  points)  and  edges.  A  natural  questions  that  arises,  then, 
is:  Is  it  possible  to  recover  structure  and  depth,  given  that  we 
have  matched  a  macrofeature  (i.e.,  a  planar  contour)  instead 
of  a  microfeature?  We  prove  that  it  L.  Of  course  in  this 
study  we  don’t  deal  with  “how  to  match  the  planar  contours 
in  the  two  stereo  frames,”  i.e.,  to  find  in  both  images  the  con¬ 
tours  which  are  due  to  the  projection  of  the  same  three- 
dimensional  planar  contour  (size,  color,  texture,  fractal  dimen¬ 
sion  could  be  used  for  the  solution  of  this  problem). 

We  also  show  that  it  is  possible  to  solve  the  3-D  motion 
determination  problem  without  using  point-to-point 
correspondence  for  the  case  where  the  image  object  is  a  planar 
contour.  Here,  we  show  that  it  is  possible  for  the  case  of  a 
planar  contour,  i.e.,  a  binocular  observer  can  understand  the 
3-D  motion  of  a  contour,  from  two  temporally  close  positions 
of  the  contour,  without  using  any  point-to-point  correspon¬ 
dence.  Of  course  there  are  still  difficulties  with  this  new 
approach  and  the  inherent  problems  of  the  dynamic  imagery 
appear  in  another  form,  different  from  the  one  in  the  tradi¬ 
tional  methods  (camera  =g>  retinal  motion  =>  3-D  motion); 
but  it  turns  out  that  these  problems,  in  the  presence  of  small 
noise  percentages,  hardly  affect  the  results. 

The  organization  of  the  section  is  as  follows.  Section  6.3 
describes  previous  work.  Section  6.4  introduces  the  concept 
of  “aggregate  stereo,”  a  method  that  computes  the  structure 
and  depth  of  a  3-D  planar  contour  from  its  images  on  the  left 
and  right  fiat  retina.  Section  6.5  introduces  new  constraints 
for  the  stereo  problem,  which  are  not  based  on  triangulation, 
but  on  the  change  of  area  and  perimeter  in  the  left  and  right 
images  of  the  contour.  Section  6.6  introduces  the  concept  of 
determining  the  direction  of  the  translation  of  a  translating 
planar  contour,  without  using  any  point-to-point  correspon¬ 
dence,  and  introduces  the  reader  to  Section  6.7  which  deals 
with  the  solution  of  the  general  problem  (the  case  where  the 
3-D  planar  contour  is  translating  and  rotating). 

In  what  follows,  because  of  the  discrete  nature  of  images, 
we  will  consider  a  contour  either  as  a  collection  of  points 
(which  it  actually  is)  or  as  a  continuous  curve,  when  needed 
to  establish  the  mathematical  rigorousness  of  a  proof. 

6.3.  Previous  Work 

The  idea  of  using  more  than  one  camera  to  recover  the 
shape  of  a  contour  seems  to  be  new. 

The  recovery  of  three-dimensional  shape  and  surface 
orientation  from  a  two-dimensional  contour  is  a  fundamental 
process  in  any  visual  system.  Recently,  a  number  of  methods 
have  been  proposed  for  computing  this  shape  from  contour. 
For  the  most  part,  previous  techniques  have  concentrated  on 
trying  to  identify  a  few  simple,  general  constraints  and 
assumptions  that  are  consistent  with  the  nature  of  all  possible 
objects  and  imaging  geometries  in  order  to  recover  a  single 
“best”  interpretation,  from  among  the  many  possible  for  a 
given  image.  For  example,  Kanade  [1981]  defines  shape  con¬ 
straints  in  terms  of  image  space  regularities  such  as  parallel 


lines  and  skew  symmetries  under  orthographic  projection. 
Wifkin  [1981]  looks  for  the  most  uniform  distribution  of 
tangents  to  a  contour  over  a  set  of  possible  inverse  projections 
in  object  space  under  orthography.  Similarly,  Brady  and 
Yuille  [1984]  search  for  the  most  compact  shape  (using  the 
measure  of  area  over  perimeter  squared)  in  the  object  space  of 
inverse  projected  planar  contours. 

Rather  than  attempting  to  maximize  some  general 
shape-based  evaluation  function  over  the  space  of  possible 
inverse  projective  transforms  of  a  given  image  contour,  and 
keeping  in  our  framework  of  attempting  unique  solutions 
without  employing  any  restrictive  assumptions  and  heuristics, 
we  propose  to  find  a  unique  solution  by  using  more  than  one 
camera,  since  it  can  be  easily  proved  that  only  one  image 
(under  orthography  or  perspective)  of  a  planar  contour  admits 
infinitely  many  interpretations  of  the  structure  of  the  world 
plane  on  which  the  contour  lies,  if  no  other  information  is 
known.  Finally,  the  need  for  a  unique  solution,  which  is 
guaranteed  in  our  approach,  comes  also  from  the  fact  that 
there  exist  many  real  world  counterexamples  to  the  evaluation 
functions  that  have  been  developed  to  date.  For  example, 
Kanade’s  and  Witkin’s  measures  incorrectly  estimate  surface 
orientation  for  regular  shapes  such  as  ellipses  (which  are  often 
interpreted  as  slanted  circles).  Brady’s  compactness  measure 
does  not  correctly  interpret  non-compact  figures  such  as  rec¬ 
tangles  since  he  will  compute  them  to  be  rotated  squares  (e.g. 
if  we  view  a  rectangular  table  top,  we  do  not  see  it  as  a 
rotated  square  surface,  but  as  a  rotated  rectangle). 

Finally,  the  need  for  the  solution  of  the  3-D  motion 
parameter  determination  problem  without  using  point-to- 
point  correspondence  for  the  case  of  the  planar  contour  has 
recently  been  appreciated  by  Kanatani  [1985],  and  has  led  to 


methods  of  great  mathematical  elegance.  The  methods  that 
we  will  propose  in  the  following  sections  are  quite  intuitive 
and  can  be  considered  immune  to  small  noise  percentages. 

6.4.  Aggregate  Stereo 

In  this  section  we  present  a  theory  for  the  recovery  of 
the  three-dimensional  parameters  of  a  planar  contour,  from 
its  left  and  right  images,  without  using  any  point-to-point 
correspondence.  Instead,  we  consider  all  the  point  correspon¬ 
dences  at  once;  thus,  there  is  no  need  for  the  solution  of  the 
correspondence  problem  of  points.  Correspondence  of  the 
contours  as  a  whole  is  required.  We  consider  a  contour  to  be 
discrete,  i.e.  a  collection  of  points. 

Let  a  coordinate  system  OXYZ  be  fixed  with  respect  to 
the  left  camera,  with  the  Z  axis  pointing  along  the  optical 
axis.  We  consider  that  the  image  contour  is  a  collection  of 
points.  So,  C,  =  {( xti,  jfe)|«=l,  ....  n}  and  C'r  =  {(zV;, 
yrt- ) | *  —  1,  .  .  .  ,  n}.  Consider  a  point  (X,-,  Y,-,  Zf)  on  the 
world  plane  and  its  projections  (i(l ,  yti),  (xri,  yri)  on  the  left 
and  right  image  frames  respectively.  Then 

fd 

(6.1) 

y/i  =  yri  (6.2) 

with  /  the  focal  length  (see  Fig.  6).  We  proceed  with  the  fol¬ 
lowing  propositions. 

Proposition  1:  Under  the  established  nomenclature,  the 
quality 
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ram 


is  directly  computable  (no  correspondence  is  required). 


Proof:  We  h  ave 


E  ~  (from  equation  (6.1))= 


"  *  (*■«  -  *■«-) 


A  xli  yti  A  xri  Vti  jF  rr  ,lU 

=  E~ 7T-  '  E  — 77- =  (from  equation  t.6  •>)) 
i  =  i  Jd  i  =  I  /d 


A  xli  Vli  A  xri  Vri 

~  h~fd~ 


A  Vti  _  A  xti  Vli  A  Vrt 

hc^~hrid~~hnd~ 


From  equation  (6  3)  the  claim  is  obvious. 


Proposition  2:  Using  the  aforementioned  nomenclature,  the 
parameters  p,q  and  c  of  the  contour  in  view  are  directly  com¬ 
putable  without  using  any  point-to-point  correspondence 
between  the  two  frames. 


Proof:  The  equation  of  the  world  planar  contour  when 
expressed  in  terms  of  the  coordinates  of  the  left  frame 
becomes 


-j  =  (J-pxi -m)^j 


It  follows  that 


~  =  (/-  P*n  -  <Wu)-7  »  =1,2,3,  ,n  (8.5) 

Z,  CJ 


Now,  we  have 


A  y*  A  ,,  ,  yti 

E  -?-  =  E  (/  -  pxh  m )  —7 

i  =  l  i  —  1  CJ 


E  -y-  =  —  E  yt  E  p* i.  vli  +  E  m  vli  (6.6) 

i  =  l  A  c  i=l  CJ  ,  =  1  i  =  l 


The  left-hand  side  of  equation  (6.6)  has  been  shown  to 
be  computable  without  using  any  point-to-point  correspon¬ 
dence  (see  Proposition  1). 

If  we  write  equation  (6.6)  for  three  different  values  of  k, 
we  obtain  the  following  linear  system  in  the  unknowns  p,  q, 
c  which  in  general  has  a  unique  solution: 


A  xa  yu  A  * n  y n  i  A  k  i  i  T  A  *  i 

E  -rr-  -  E  —77—  =  -  E  yt  77  E  p*i.yt, 

i=i  Jd  ,  =  i  Jd  c  i  =  i  c]  I  ,  „  ] 

(6  7) 
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,  n  i  r  n 

7  E  yt3--p-  E 

C  i=l  C  /  »  =  1 

n 

+  E  wust k.3 

i=i 


P*K!f«3 


where  we  have  used  equation  (6.3)  in  the  left  hand  sides. 

The  solution  of  the  above  system  recovers  the  structure 
and  the  depth  of  the  world  planar  contour  without  any 
correspondence  and  this  is  the  conclusion  of  Proposition  2. 


6.4.1.  Practical  considerations 


It  is  clear  from  the  previous  analysis  that  we  have  con¬ 
sidered  that  the  number  of  points  in  the  left  and  right  con¬ 
tours  (number  of  points  in  the  contours  on  the  left  and  right 
image  plane)  is  the  same.  But  in  a  practical  situation, 
depending  on  the  orientation  of  the  world  contour  and  due  to 
foreshortening  effects,  the  number  of  points  in  the  left  and 
right  image  contours  will  not  be  the  same.  To  account  for 
this,  we  divide  the  summations  in  equations  (6.7),  (6.8)  and 
(6.9)  by  the  total  number  of  points.  For  example,  one  of 
these  equations  becomes 


4[  7-E*«(y«)‘-7-E*r.(y„) 

Jd  n,  nr 


=/  —  E  (»)*  -  p—  E  **■  (» *  )*-<?—  E  y*+1 

nt  nt  nt 


where  n(,  nr  are  the  total  numbers  of  points  constituting  the 
left  and  right  image  contours  respectively.  To  check  the  con¬ 
sistency  of  these  equations  (unbiasedness,  sufficiency),  we 
assume  that  the  points  are  perturbed,  i.e. 

Xti  =  Xti  +  <(*«)  with  t(xK)  ~  iV( 0.(T2) 

and  the  same  for  the  rest  of  the  coordinates,  with  the  assump¬ 
tion  of  independence  among  the  stochastic  variables  xti,  yti, 
xri  and  yri  We  will  prove  in  the  sequel  that  equation  (6.10) 
is  consistent  if  and  only  if  k  =  0. 

Indeed,  consider 


E  —  E  *«(»)*  =  —  E  E(xli)E{yt) 

n,  n, 


=  —  E  E.  £(»«)  =  —  E  xti  Yt 

n,  n, 


iff  k  —  0  or  1 ,  since  E(yt *)  =  T(f  iff  ft  —  0  or  1 .  But 


e  —  E(y«)t+1  =  —  E  (>/.)* 

nl  nt 


iff  k  =  -  1  or  0. 


So,  for  unbiasedness  of  error  it  is  necessary  that  k  =  0. 
To  check  sufficiency  we  need  to  check  whether  or  not 


Var  j  —  E-0,  |  — *  0  as  n(  — *  oo. 

But  Var  |  —  xt,  )  =  — L  E  Var( a-,, ) 

(  ni  I  n( 


I  2  or-  r, 

=  — 5-  nt  a  —  —  — ►  U  as  n(  — *  00 . 
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Similarly  for  the  other  coefficients.  Thus,  in  order  to  have  a 
robust  method,  we  showed  that  the  exponents  k  should  be 


i*  /  j*  «**  *  j*.  r  r  ^ 


zero,  in  which  case  aii  three  equations  (6.7),  (6.8)  and  (6.9) 
degenerate  to  one  equation.  In  that  case,  we  need  to  divide 
the  contours  into  three  separate  parts  (see  Figure  6.1)  and 
apply  equation  6.10  to  each  one  of  the  corresponding  parts.  It. 
can  be  easily  shown  that  the  resulting  system  of  three  equa¬ 
tions  in  the  three  unknowns  p,  q,  c  always  has  a  solution, 
unless  the  centers  of  mass  of  the  three  different  areas  are  col- 
linear  (Fig.  6.1). 

6.5.  Shape  from  Contour  and  Stereo  Based  on 

Change  of  Length  and  Area 

In  this  section  we  develop  a  computational  theory  for  the 
detection  of  shape  from  contour  by  a  binocular  observer, 
without  using  correspondences,  and  utilizing  invariant  proper¬ 
ties  of  area  and  length. 

Consider  a  coordinate  system  OXYZ  to  be  fixed  with 
respect  to  the  left  camera,  with  the  -Z  axis  again  pointing 
along  the  optical  axis.  We  consider  that  the  image  plane  of 
the  left  camera  is  perpendicular  to  the  Z  axis  at  the  point 
(0,0,1).  The  nodal  point  of  the  right  camera  is  the  point 
(zhz,0,0)  and  the  image  plane  of  the  right  camera  is  identical 
to  the  one  of  the  left  camera.  C  is  a  contour  on  the  world 
plane  n  with  equation  Z  -  pX  +  qY  4-  c ,  and  C/  and  Cr  are 
the  projections  of  the  contour  C  on  the  left  and  right  image 
respectively,  using  perspective  projection  (see  Figure  6). 
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right  image,  the  result  should  be  the  same.  From  this  fact  we 
can  derive  one  more  constraints  and  thus  solve  for  the  orien¬ 
tation  of  the  surface  in  view. 

For  that,  we  need  to  develop  the  first  fundamental  form 
of  the  world  plane  as  a  function  of  the  retinal  coordinates,  in 
order  to  be  able  to  compute  the  length  of  the  world  contour 
(up  to  a  constant  factor,  of  course).  If  we  fix  a  coordinate 
system  OXYZ  with  the  Z  axis  as  the  optical  axis  and  focal 
length  F  =  1  and  we  consider  a  plane  II:  Z  —  pX  +  qY  +  c 
in  the  world  with  a  contour  C  on  it,  and  we  denote  by  ( x,y ) 
the  coordinates  on  the  image  plane,  then  a  point  (X,Y,Z)  in 
the  world  planar  contour  C  is  projected  onto  the  point 


The  inverse  imaging  function,  call  it  /,  is  the  function 
that  maps  the  image  plane  onto  the  world  plane;  so,  if  (x,y)  is 
an  image  point,  the  3-D  world  point  in  the  plane 
Z  =  pX  +  qY  +  c  that  has  (x,y)  as  its  image  is  given  by 

f(x,y)  —  [  _ , _ C-l _ , _ _ 

(  F  -  px  -  qy  F  px-qy  F  -  px  -  qy 

The  first  fundamental  form  of  /  [Lipschutz,  1969]  is  the 
quadratic  form 
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Figure  8.X:  Dividing  the  contours  into  three  parts. 


There  exist  invariant  quantities  in  this  particular  case, 
namely  properties  of  the  3-D  contour  that  can  be  found  from 
either  frame  (left  or  right).  As  such  properties,  we  choose 
area  and  perimeter.  In  other  words,  the  change  of  the  area  of 
the  contour  from  the  left  to  the  right  frame  as  well  as  the 
change  in  the  perimeter  should  be  connected  to  intrinsic  pro¬ 
perties  of  the  world  contour,  for  example,  its  orientation.  If 
5/,  and  Sp  are  the  areas  of  the  image  contours  in  the  left  and 
right  images  respectively,  then  the  following  relation  holds: 


$L  _  1  ALp  D,q 
SR  1  ARp  BHq' 


(0.11) 


where  (Ap,  Bt),  (Ap.  Bp)  the  centers  of  mass  of  the  left  and 
right  image  contours  respectively  and  the  focal  length  is 
unity.  For  the  proof  of  (6.11)  see  Appendix  2. 


Clearly,  (6.11)  is  a  linear  equation  in  the  unknowns  p,q , 
i.e.  the  gradient  of  the  world  plane.  On  the  other  hand,  there 
is  information  from  the  perimeters  of  the  image  contours  and 
we  haven't  yet  used  it.  In  particular,  if  we  compute  the  per¬ 
imeter  of  the  world  contour  from  the  left  image  or  from  the 


E  dx 2  +  2 F  dx  dy  +  E  dy~ 

with 


f  =fz  f,  and 

G=f,  I, 


(1  -  px  -  qy 


vK1  qy)2  +  pV  +  v2 


(1  -  px  -  qy) 


7K1  -  qy)qz  +  0  -  px)py  <-pq | 
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If  we  consider  two  points  (z.y)  and  (r  4  Jr,  y  4  dy)  on  the 
image  plane,  then  the  3-D  distance  dC  of  the  corresponding 
points  on  the  world  plane  is  given  by 


non  v  si 


dC  —  v  E  dx2  +  2 F  dz  dy  +  G  dy2 

Consequently,  if  we  have  a  contour  C  on  the  image  plane, 
then  the  3-D  planar  contour  has  perimeter  length 

J^E  dz 2  +  2 F  dx  dy  +  G  dy 2  on  the  image  plane  (6. 12) 


Expression  (6.12)  can  be  used  to  compute  the  length  of  the 
world  contour  from  either  image  frame.  Let  and  Lr  the 
expressions  for  the  length  of  the  3-D  contour  as  computed 
from  the  left  and  right  image  frames  respectively.  Thus, 


L,  -  L,  =  0 


(6.13) 


represents  a  nonlinear  constraint  on  the  parameters  of  the 
world  plane.  It  has  to  be  emphasized  that  equation  (6.13) 
involves  as  unknowns  only  the  components  of  the  gradient 
(p,q)  of  the  plane  in  view;  the  constant  c  is  eliminated  using 
an  approximation  of  the  perspective  projection,  called  para- 
persoective  |Aloimonos,  1986].  The  system  of  equations  (6.11) 
and  (6.13)  gives  a  finite  number  of  solutions  for  the  orienta¬ 
tion  of  the  contour  (p,q). 

6.5.1.  Practical  considerations 

A  closed  form  solution  for  the  system  of  equations  (6.11) 
and  (6.13)  appears  very  hard,  if  not  impossible,  since  no 
specific  contours  are  assumed.  Changing  the  formulation  for 
surface  orientation  from  gradient  space  to  the  Gaussian 
sphere  (Horn,  1986],  we  need  to  compute  the  azimuth  0  and 
elevation  <t>  of  the  planar  surface  in  view.  Equation  (6.11) 
when  transformed  to  the  coordinates  of  the  Gaussian  sphere 
represents  a  great  circle  G .  To  solve  the  problem,  we  need  to 
find  which  point  of  that  great  circle  makes  equation  (6.13) 
correct.  In  other  words,  in  a  discretized  Gaussian  sphere  we 
need  to  find  the  point  (0,<t>)(zG  that  minimizes  the  function 
(In  -  Lr)~.  Multiple  solutions  are  possible  even  though  our 
experiments  showed  unique  solutions  for  the  examples  con¬ 
sidered.  Figure  5.1.1  shows  the  constraints  from  area  and 
length  drawn  on  the  Gaussian  sphere.  It  was  recently  shown 
[Aloimonos  and  llerve',  1987]  that  there  can  be  at  most  two 
solutions;  and  criteria  for  checking  multiplicity  of  the  solu¬ 
tions  have  been  developed 

6.6.  Determining  the  Direction  of  Translation 

Here  we  only  treat  the  case  of  pure  translation.  The 
general  case  is  treated  in  the  next  section.  The  treatment  in 
this  section  presumes  perspective  projection. 

Consider  a  coordinate  system  OXYZ  fixed  with  respect 
to  the  camera,  O  the  nodal  point  of  the  eye  and  the  image 
plane  perpendicular  to  the  Z  axis  (focal  length  1)  that  is 
pointing  along  the  optical  axis.  Let  us  represent  points  on  the 
image  plane  with  small  letters  (( j-, y ))  and  points  in  the  world 
with  capital  letters  ((.Y,  Y,Z)).  (See  Figure  6.2.) 

Let  a  point  P  -  -  (.Y , ,  1  ;,Zl)  in  the  world  have  perspec¬ 
tive  image  ( j:  i , y  i )  where  xl  —  Xl/Zl  and  y,  —  )\/Zl.  Let 
the  point  P  move  to  the  position  P1  --  (X 2,Yo,Z2)  with 

Y,  A',  +  A.Y 

r?  >',  +  A  V 
Z2  -  Z,  4  A Z. 

Then  we  desire  to  find  the  direction  of  the  translation 
(A.Y/AZ,  AY/AZ)  If  the  image  of  P’  is  (x2,yn)  then  the 


const  raint 


Figure  6.1.1:  Length  and  area  constraints. 

observed  motion  of  the  world  point  in  the  image  plane  is 
given  by  the  displacement  vector  (x 2  —  j ,  y 2  —  y i )  (which  in 
the  case  of  very  small  motion  is  also  known  as  optic  flow). 

We  can  easily  prove  that 

AX  -  i\AZ 


X. 


y-t-y\ ■- 


Cj  +  AZ 

AF-y.AZ 
z,  +  A  Z 


Under  the  assumption  that  the  depth  is  large  (and  the 
motion  in  depth  small),  the  equations  above  become 

AA'-x,AZ 


Xo-X.= 


AY  -  y,AZ 
~Zx 


tion  (AA/AZ,  AY/AZ)  are  based  on  equations  (6.14)  and 
(6.15)  (see  (Ullman,  1979;  Longuet-Higgins,  1981;  Tsai  and 
Huang,  1984]),  which  of  course  require  the  knowledge  of  the 
correspondence  between  points  in  the  successive  frames.  In 
the  sequel,  we  present  a  method  for  the  recovery  of  the  trans¬ 
lational  direction  of  a  moving  planar  contour(A.Y/AZ, 
AY/AZ),  without  having  to  solve  the  point-to-point 
correspondence  problem. 

Consider  again  a  coordinate  system  OXYZ  fixed  with 
respect  to  the  camera  as  in  Figure  6.2  and  let 
A  =  {(A, , ) ,  ,Z,)/»  =  1,  2,  3 . n),  such  that 


Z,  =  pXj  +  q  )  ,  +  c,  t  =  l, 


2,3. 


that  is,  the  points  are  planar.  Let  the  points  translate  rigidly 
with  translation  (A.Y,  AT,  AZ),  and  let  {(x, ,  y,  )/i  =  1 ,  2, 
3.  ,  n  }  and  {( x[,yl)/i  1,  2,  3 . n  }  be  the  projec¬ 

tions  of  the  set  A  before  and  after  the  translation,  respec¬ 
tively. 

Consider  a  point  ( x, , y, )  in  the  first  frame  which  has  a 
corresponding  one  (x',y')  in  the  second  (dynamic)  frame.  For 
the  moment  we  do  not  worry  about  where  the  point  ( x/ , y/ )  is, 
but  we  do  know  that  the  following  relations  hold  between 
these  two  points: 
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Figure  6.2:  Motion  of  a  point. 
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Similarly,  if  we  do  the  same  for  equation  (6.20).  we  get 
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At  this  point  it  has  to  be  understood  that  equations 
(6.21)  and  (6.22)  do  not  require  our  finding  any  correspon¬ 
dence. 

By  dividing  equation  (6.21)  by  equation  (6.22),  we  get 


where  Z,  is  the  depth  of  the  3-D  point  whose  projection  (on 
the  first  dynamic  frame)  is  the  point  (x, -,!/,)  Taking  now  into 
account  that 

ir iJ^rL  (618) 

the  above  equations  become 

xl  -  x,  -  (f  A. X  x,AZ)f  PX'~  m  (6.19) 


(6.20) 


If  we  now  write  equation  (6.19)  for  all  the  points  in  the 
two  dynamic  frames  and  sum  the  resulting  equations  up,  we 
get 
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Equation  (6.23)  is  a  linear  equation  in  the  unknowns 
AX /A Z,  AY / AZ  and  the  coefficients  consist  of  expressions 
involving  summations  of  point  coordinates  in  both  dynamic 
frames;  for  the  computation  of  the  latter  no  establishment  of 
any  point  correspondences  is  required. 

So.  if  we  consider  a  binocular  observer,  applying  the 
above  procedure  in  both  left,  and  right  “eyes”,  we  get  two 
linear  equations  (of  the  form  of  equation  (6.23))  in  the  two 
unknowns  AX/AZ,  AY f AZ ,  which  constitute  a  linear 
system  that  in  general  has  a  unique  solution 

0.6.1.  What  the  previous  method  is  not  about,  an 
unexpected  bonus  and  some  problems 

If  one  is  not  careTuI  when  analyzing  the  previous 
method,  then  he  might  think  that  all  the  method  does  is  to 
correspond  the  center  of  mass  of  the  image  points  before  the 
motion  with  the  center  of  mass  of  the  image  points  after  the 
motion,  and  then  based  on  that  retinal  motion  to  recover 
three  dimensional  motion  But  this  is  wrong,  because  perspec¬ 
tive  projection  does  not  preserve  simple  ratios,  and  so  the 
center  of  mass  of  the  image  points  before  the  motion  does  not 
correspond  to  the  center  of  mass  of  the  image  points  after  the 
motion.  All  the  above  method  does  is  aggregation  of  the 
motion  constraints;  it  does  not  correspond  centers  of  mass. 

It  should  be  noted,  however,  that  for  the  method  to 
work  the  same  number  of  points  (contour  periphery  points)  is 
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required  in  both  frames  (before  and  after  the  motion).  If  the 
motion  is  not  large,  then  this  is  true  and  no  problem  arises. 

6.7.  Detecting  Unrestricted  3-D  Motion 

In  the  previous  section  we  presented  a  method  to  recover 
the  direction  of  the  translation  of  a  translating  planar  contour 
from  the  motion  of  its  left  and  right  images.  It  is  clear  that 
we  did  not  use  any  depth  information.  In  this  section  we 
present  a  method  of  recovering  the  motion  parameters  of  a 
rigidly  moving  planar  contour.  Any  rigid  motion  can  be 
represented  as  a  rotation  around  an  axis  that  we  can  freely 
choose  to  pass  through  any  point  of  our  choice,  plus  a  transla¬ 
tion.  The  problem  then  is  reduced  to  finding  the  translation 
and  the  rotation  matrix. 

So,  suppose  that  we  have  four  images  of  a  moving  planar 
contour  (left  and  right  before  the  motion,  left  and  right  after 
the  motion).  With  the  already  presented  methods,  we  can 
recover  the  orientation  and  depth  of  the  3-D  contour  before 
and  after  the  motion;  let  (pj,  qt,  c^  and  p2l  q2 >  ci )  be  the 
parameters  of  the  3-D  contour  before  and  after  the  motion 
respectively.  But  since  we  know  the  contour  in  3-D,  we  will 
do  our  analysis  of  the  motion  in  3-D,  instead  of  the  image 
plane. 

So,  let 

cl  =  {(X;,r,,z<)|.-  =  i - n} 

and 

Ci={{Xl,Yj,ZJ)\i  =  \ . m) 

be  the  two  positions  of  the  3-D  contour. 

We  assume  that  the  rotation  axis  passes  through  the 
center  of  gravity  of  Ct.  This  has  as  an  immediate  conse¬ 
quence  that  the  translation  is  given  by  the  displacement  of 
the  center  of  the  gravity  between  the  two  positions  of  the 
contour.  So, 

translation  =  (AX,  AT,  A Z)  =  T  —  center  of  mass 

of  Cr  center  of  mass  of  C[  = 

j  E*  ^zl  'Z2' 

(  m  n  m  n  '  m  n 


It  is  obvious  that  we  used  different  points  in  the  two 
positions  of  the  contour.  Obviously,  we  did  not  need  to  do 
this.  The  methods  that  find  the  3-D  position  of  the  contour 
do  not  address  any  “aperture  in  the  large”  problem.  But  the 
3-D  points  are  found  from  their  projections  and  discretization 
effects  may  cause  a  small  difference  in  the  number  of  points  of 
the  two  positions  of  the  contour  We  found  that  equation 
(6.24)  gave  good  results  as  we  will  see  in  the  section  on  experi¬ 
ments. 

What  remains  to  be  found  is  the  rotation  matrix.  But 
since  we  know  the  surface  normals  n,=(p,,  q\.  -1), 

ti2  =  (p2,  <j2,  -1)  of  the  two  positions  of  the  contour,  we  can 
immediately  find  the  rotation  around  an  axis  parallel  to  the 
plane  of  the  contour  C Indeed,  the  angle  9  between  nq  and 
n2 


"l 

n2 

ll»lll 

11**11 

The  angle  $  along  with  the  axis  /  constitute  a  rotation  matrix, 
R j.  It  is  obvious  that  Rx  is  not  the  final  rotation  matrix 
because  it  misses  rotation  around  an  axis  perpendicular  to  the 
world  plane.  In  other  words,  if  we  apply  to  contour  C1  the 
rotation  matrix  R  j  and  the  translation  T,  then  the  result  will 
not  be  contour  C2  but  a  contour  C[  which  lies  on  the  same 
plane  as  C2,  and  has  the  same  center  of  gravity  of  C2.  To 
find  the  missing  rotation,  we  must  find  the  angle  that  we 
have  to  rotate  contour  C J  around  an  axis  n  which  passes 
through  the  center  of  gravity  of  contour  C2  and  is  perpendic¬ 
ular  to  C2. 

To  do  that,  we  start  rotating  the  contour  C j  until  it 
coincides  with  contour  C2.  This  is  dont:  with  small  incre¬ 
ments  and  the  coincidence  of  the  two  contours  (C{  and  C2)  is 
signaled  by  the  maximization  of  their  common  area.  The 
resulting  angle  <t>  along  with  the  axis  n  constitute  a  new  rota¬ 
tion  matrix  /?2.  Obviously,  the  final  rotation  matrix  is  given 
by  R  =  R ,  R2. 

Finally,  it  is  clear  that  the  method  described  above  will 
not  work  (rotation  matrix  R2  will  not  be  found)  for  some 
symmetric  contours.  If,  for  example,  the  3-D  contour  is  a  cir¬ 
cle,  matrix  /?2  cannot  be  found,  since  C J  and  C2  coincide;  or 
if  the  3-D  contour  is  a  square  and  the  rotation  angle  4>  =  ir/2, 
then  again  matrix  /?2  cannot  be  found.  This  simple  fact  is 
also  obviously  true  for  human  observers  who  observe  apparent 
motion  and  are  asked  to  estimate  the  3-D  motion  parameters. 
Figure  6.3  shows  this  scheme  in  detail. 

6.8.  Using  a  Monocular  Observer 

Extension  of  the  above  results  can  obviously  be  trivially 
generalized  for  a  monocular  observer  who  is  translating  with 
known  motion. 

We  proceed  now  with  the  final  section  which  describes 
experimental  results  based  on  the  previous  methods  for  the 
recovery  of  structure,  depth  and  3-D  motion  of  a  moving 
planar  contour  by  a  binocular  or  trinocular  observer. 


6.9.  Stability  Analysis 

Here  we  present  a  theoretical  stability  analysis  of  some 
representative  algorithms  that  were  previously  introduced. 
We  analyze  the  behavior  of  the  algorithm  in  Section  6.4 
(correspondenceless  stereo);  the  analysis  of  the  stability  of  the 
rest  of  the  algorithms  is  done  in  a  similar  way. 

Recalling  from  Section  6.4.1,  equation  (6.10)  is  used  for 
k  =0  in  three  different  image  zones  for  constructing  a  linear 
system  with  unknowns  the  parameters  of  the  plane.  This 
equation  can  finally  be  written  as 


P  E  *1.1  +  7  [  E  Vu  +  rfn  =  E  £  xu  -  £  3-nl 

i  =  l  J  |_i  =  1  J  a  =  1  i  —  1  J 

or  <?ip  +  C2q  +  rf  =  Cs  where  C,  =  C2=— £</ii 

n  n 

j  "  n  n 

and  C3=  — —  Yj  E  xn  and  n  the  total  number  of 

i=i  i=i  j 

points.  The  equation  was  recovered  by  assuming  that  the 
plane  equations  had  the  form  ps  +  qy  +  rZ  —  I  (because  if  we 
used  the  other  form,  then  we  would  have  to  consider  some 
special  cases). 


gives  the  rotation  angle  around  the  axis 


Contour  before 
the  motion 


Figure  6.3:  Contour  before  and  after  motion. 


6.9.1.  Error  due  to  small  perturbations  of  the  points 

The  following  analysis  captures  errors  due  to  discretiza¬ 
tion  and  the  uncertainty  of  the  operators  that  extract 
interesting  points.  We  will  analyze  the  effect  of  random  noise 
(in  the  image  points)  on  the  coefficients  c  t ,  c2  and  c3  and  find 
out  if  consistent  estimators  arise.  In  that  case,  robustness  is 
inferred  Suppose  random  noise  eI(  is  added  to  x(j,  with 
(1(  ~ /V(0,<z2),  i.e.,  E(i,b)  —  0  and  Var  (eI(i  =  a2.  Similar 
noise  is  added  to  the  j/(l  coordinates  as  well  as  to  the  coordi¬ 
nates  in  the  right  image  frame.  Also,  assume  that  the  sto¬ 
chastic  variables  corresponding  to  the  noise  added  to  different 
components  are  independent.  Let  z(J,  yt°,  etc.  denote  the 
observed  xti's  and  yh ’s,  etc.,  i.e.,  z(J  =  x(l-  +  «,(j,  etc.  Similarly, 
let  c°.  i  =  1,  2,  3  be  the  observed  coefficients.  It  is  easy  to 
check  that  E(c°)  —  Cj  for  i  =  1,  2,  3.  On  the  other  hand, 
checking  the  variances  of  the  coefficients  c°,  we  have 
0.2 

Var(  c  f )  — - -  o  as  n  — *  oo 

n 

Var(  c  2 )  — - »  o  as  n  — *  oo 


and 


Var(c3 )  = 


2<z2 

d2n 


o  as  n 


oo. 


So,  when  random  noise  is  added  to  the  image  points,  the 
coefficients  c°,  i  =-  1,  2,  3  are  consistent  estimators  of  the 
corresponding  true  coefficients. 


6.9.2.  Error  due  to  the  addition  and  deletion  of 
random  noise 

Here  we  analyze  the  effect  of  drop-ins  and  drop-outs. 
More  specifically,  there  will  be  points  in  the  left  image  frame 
whose  corresponding  ones  won’t  be  found  by  the  interest 
point  operator  in  the  right  image  frame,  <tiid  vice  versa. 
Because  this  effect  is  very  hard  to  capture  quantitatively  and 
it  will  depend  on  the  particular  “point  extraction’’  operator, 
we  do  our  analysis  by  assuming  that  this  problem  is  realized 
by  adding  random  points  to  the  different  frames  Let's 
analyze  the  effect  of  adding  m,  and  m2  noise  points  to  the 
left  and  right  images  respectively.  We  assume  that  these 
points  are  distributed  randomly  (uniformly)  over  the  image 


frames,  i.e.,  if  (i,?,  y ")  denotes  the  ilh  noise  point  in  the  left 
frame,  then 

x^-U (-*,*),  y«,~U(  *.*) 

and  they  are  independent,  where  the  image  is  represented  in 
the  box  (Fig.  6.3.1). 


(*.-*)  (*.*) 


Figure  6.3.1. 

The  same  symbolism  is  assumed  for  the  noise  points  in 
the  right  image  frame  and  the  noise  is  assumed  to  be  indepen¬ 
dent.  Let  c  * ,  eg  and  be  estimators  of  the  coefficients  that 
were  defined  in  the  previous  sections  after  adding  noise 
points.  We  further  assume  (for  simplicity  of  the  calculations 
and  without  loss  of  generality)  that  mj  =  m2-m.  Then  we 
have 
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Since  the  mean  of  the  noise  points  is  assumed  to  be  zero,  i.e 
K ( J/‘V)  0,  etc.,  we  have 


E(ci)=—2—f  „  £(c2*)  = 

n  +  m 


and  E(c£) 


So,  £(<•,*)'- - c(l  i  ~  1,  2,  3. 

n  +  m 

Thus,  unlike  the  perturbation  noise,  throwing  in  random 
points  makes  the  estimates  c‘  underestimate  c,  by  a  factor  of 

— - —  for  i  1 ,  2,  3. 
n  <-  m 

Let  us  now  examine  the  consistence  of  the  estimates 
We  have 

k  k 

Var(rfi')  —  f  z2di  ■■■-  — j-3  f  ---  — 

'  2k  J  6 k  J  3 


1  ,,  k~  2m  ,2 

Varfc;,)  --  - 2m —  -  — - -k 

m  i  n  3  3(  m  +  n  ) 

This  is  enough  to  convince  us  that,  unlike  random  perturba¬ 
tion  of  points,  throwing  in  points  at  random  makes  the  esti¬ 
mators  based  on  the  image  points  inconsistent  However, 
from  actual  experiments  we  observed  that  the  effect  of  this 
kind  of  error  hardly  affected  the  results,  when  the  difference 
in  the  number  of  points  in  the  left  and  right  image  frame  was 
about  20^7  of  the  total  number  of  points, 

6.0.3.  Error  due  to  numerical  instabilities 

The  fact  that  we  are  left  with  the  solution  of  a  linear 
system  for  our  correspondenceless  stereo  does  not  mean  that 
the  solution  will  be  robust  The  matrix  might  be  ill- 
conditioned  or  critically  ill-conditioned  The  analysis  is  the 
same  as  in  Section  5.5  and  will  be  omitted  here  Experiments 
showed  robustness  but  a  theoretical  classification  of  pathologi¬ 
cal  cases  (point  distributions  that  create  critical  ill- 
conditioning)  is  still  needed. 

6.9.4.  General  stability  comments 

We  think  that  the  level  of  robustness  or  stability  is  an 
essential  part  of  the  Marr  paradigm  (that  Marr  left  implicit  in 
his  writings).  In  this  paper  we  used  some  simple  statistical 
analysis  to  demonstrate  the  behavior  of  our  algorithms  in  the 
presence  of  noise.  One  can  resort  to  worst  rase  or  average 
error  analysis,  assuming  some  probability  distribution  of  the 
input.  A  nice  methodology  for  computing  the  effects  of 
discretization  error  is  presented  in  Kamgar-I’arsi  and 
Kamgar-I’arsi.  1987i  and  can  be  applied  here  as  well  We 
think  that  research  on  robustness  should  go  even  further 
Some  problems  (such  as  structure  from  motion)  ate  unite 
un-table  (as  observed)  for  all  algorithms  that  have  been  pro¬ 
posed.  Maybe,  then,  the  problem  itself  is  unstable  by  its 
nature  and  should  be  formulated  differently  For  this  to  hap¬ 
pen.  one  should  have  to  prove  that  the  problem  is  unstable, 
regardless  of  the  algorithm  used.  In  a  sense,  one  would  have 
to  consider  the  spare  of  all  problems,  define  a  probability  dis¬ 
tribution  in  the  space  of  all  problems  and  then  compute  the 
probability  distribution  of  the  chanre  a  problem  has  to  be 
unstable,  where  instability  is  expressed  as  the  distance  from  a 
set  of  ill-posed  problems  Such  an  approach  has  been  ini¬ 
tiated  in  numerical  analysis  Demmcl.  19871 


6.10.  Experiments 

We  experimented  with  synthetic  and  natural  images. 
Again  the  acquisition  of  natural  images  was  done  through  a 
solid  state  Panasonic  camera  and  the  motion  control  through 
a  Merlin  American  Robot  arm. 

6.10.1.  Synthetic  images 

The  following  experiments  implement  the  algorithms  of 
Section  6.4. 

Figure  6.4  shows  the  projections  of  a  set  of  planar  points 
on  both  the  left  and  right  frames.  The  frame  on  top  is  the 
superposition  of  the  left  and  right  frames.  The  actual  param¬ 
eters  of  the  plane  were  p  =0.0,  q  =0.0,  c  =  10000,  while  the 
number  of  points  was  equal  to  1000.  We  did  not  include  any 
noise  in  our  pictures.  The  computed  ones  were  P=  -00. 
Q  =  -0  0,  C  =  10000.0. 

Figure  6.5  shows  the  projections  of  a  set  of  planar  points 
on  both  the  left  and  right  frames.  The  frame  on  top  is  the 
superposition  of  the  left  and  right  frames.  The  actual  param¬ 
eters  of  the  plane  were:  p  =  1.0,  q  =  1  0,  c  —  10000,  while 
the  number  of  points  was  equal  to  1000.  We  did  not  include 
any  noise  in  our  pictures.  The  computed  ones  were  P  =0.98, 
Q  =  100,  C  =9809.8 

Figure  6.6  shows  the  projections  of  a  set  of  planar  points 
on  both  the  left  and  right  frames.  The  frame  on  top  is  the 
superposition  of  the  left  and  right  frames  The  actual  param¬ 
eters  of  the  plane  were  p  —  1 .0,  q  1.0,  c  10000.  while  the 
number  of  points  was  equal  to  1000.  We  included  5%  noise  in 
the  left  frame  and  7lrc  in  the  right  one  By  noise  here  we 
mean  putting  points  at  random  in  order  to  capture  the  effect 
of  the  imperfectness  of  interesting  point  extraction  operations 
In  actual  real  images,  we  would  first  have  to  extract  interest¬ 
ing  points.  The  computed  ones  were  P  1.7.  Q-  1.2. 
C  10266  7 

The  next  experiments  test  the  algorithms  in  Section  6.6 
for  determining  direction  of  translation  for  a  set  of  planar 
points  as  well  as  for  a  planar  contour,  and  the  algorithms  for 
determining  unrestricted  3-D  motion. 

Figure  6.7,  6.8  and  6  9  implement  the  algorithm  of  equa¬ 
tions  (6.23).  The  frames  at  the  bottom  represent  the  projec¬ 
tions  of  a  set  of  3D  planar  points  on  the  left  and  right  eyes 
respectively.  The  two  frames  at  the  top  represent  the  projec¬ 
tions  of  the  same  set  of  points,  after  it  has  been  translated 
The  actual  direction  of  translation,  the  computed  one,  the 
number  of  points  and  the  noise  are  seen  in  the  respective 
figures.  The  rest  of  the  figures  show  experimental  results  for 
computation  of  shape,  direction  of  translation  and  unres¬ 
tricted  3D  motion  from  the  algorithms  in  Sections  6.5.  6.6  and 
(1  7 

Figure  6  10  shows  the  images  of  a  translating  planar  con¬ 
tour  (human  figure)  taken  by  a  binocular  system  at  two 
different  time  instants.  The  actual  orientation  of  the  contour 
in  space  was  (p,q  )  ■-=  ( 10,5)  and  the  actual  direction  of  transla¬ 
tion  (dz  / dz ,  dy/dz)  -  (  4,6)  Our  program  recovered  orien¬ 

tation  (p.<7  )  =  (  10  00007,  5.000297)  and  direction  of  transla¬ 
tion  ( dz  j  dz ,  dy/dz)-  (  4.000309.  6  00463)  Figure  till 
shows  again  the  perspective  images  of  a  translating  planar 
contour  taken  by  a  binocular  system  at  two  different  time 
instants  The  actual  orientation  of  the  contour  was 
(p.<7)  (-25,30)  and  the  direction  of  translation  (dj  d; . 

dy  dz)  -(50,60).  The  computed  orientation  from  these 
images  was  ( p,q )  (  24  99,  30000021)  and  the  computed 
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Figure  8.13:  Tenth  experiment.  Figure  6.15:  Twelfth  experiment. 


Figure  6.17:  Fourteenth  experiment 
direction  of  translation  (dx  / dz ,  dy  / dz)  =  (49.858421,  and  the  estimati 

59.830266).  Figure  6.12  shows  the  perspective  images  of  a  (0.704,  0.014,  0.' 

translating  planar  contour  taken  by  a  binocular  system  at  two  as  (100,  150,  1CX 

different  times.  The  actual  orientation  of  the  contour  was  (0,123,  0.123, 

(p,q )  —  (10,  -  1 1)  and  the  direction  of  translation  (dx / dz ,  (106.11,  150.7, 

dy  j  dz )  =  ( 1.66,  3.33).  The  estimated  parameters  from  these  the  axis  (0.121, 

images  were  (p,q)  =  (9.99,  -11.000383)  and  (dx  /  dz ,  translated  by  (F 

dy / dz)  —  (1.66,  3.33).  the  axis  (0.124 

The  experiments  to  determine  the  general  motion  param-  translation  (5 

eters  are  shown  in  Figures  6.13-6.17.  The  actual  and  com-  ans  around  the 

puted  parameters  are  recalculated  with  respect  to  the  coordi-  actua^  paramete 

nate  system  of  the  left  camera.  In  Figure  6.13  the  actual  —  ~  radians 

translation  was  (100,  100,  100)  and  actual  rotation  was  0.2  estimate  parar 

radians  around  the  axis  (0.707,  0.707,  0);  the  estimated  values  an<^  r°tation  — 

were  translation  =  (100.4,  99.6,  99.8)  and  rotation  =  0.1997  0.582). 

radians  around  the  axis  (0.707,  0.707,  0).  The  results  for  the 

next  figure  were  as  follows:  actual  translation  (50,  60,  10)  and  Note.  All  the  j 

actual  rotation  —  0.2  radians  around  the  axis  (0.707,  0,  ^at  have  a  dim 

0.707).  The  estimated  translation  was  (14.25,  54.94,  39.53)  pixels,  where  I  f 


and  the  estimated  rotation  was  0.1980  radians  around  the  axis 
(0.704,  0.014,  0.711).  Figure  6.15  shows  the  actual  translation 
as  (100,  150,  100)  and  rotation  Of  0.9  radians  around  the  axis 
(0,123,  0.123,  0.985);  the  estimates  were  translation  = 
(106.11,  150.7,  99.21)  and  rotation  =  0.902  radians  around 
the  axis  (0.121,  0.119,  0.985).  The  ship  in  Figure  6.16  was 
translated  by  (100,  150,  80)  and  rotated  by  1.5  radians  around 
the  axis  (0.124,  0,  0.992);  the  recovered  parameters  were 
translation  =  (95.30,  145.98,  80.01)  and  rotation  =  1.49  radi¬ 
ans  around  the  axis  (0.124,  0,  0.992).  Figure  6.17  shows  the 
actual  parameters  as  translation  =  (100,  50,  60)  and  rotation 
=  0.2  radians  around  the  axis  (0.577,  0  577,  0.577).  The 
estimated  parameters  were  translation  =  (102.75,  49,  59.19) 
and  rotation  =  0.199  radians  around  the  axis  (0.577,  0  573, 
0.582). 

Note:  All  the  parameters  involved  in  the  above  experiments 
that  have  a  dimension  of  length  (Ll  M°  T°)  are  calculated  in 
pixels,  where  t  pixel  =  100  pm. 


6.10.2.  Natural  images 

Here  we  describe  an  experiment  where  we  used  the  algo¬ 
rithm  of  Section  6.4  to  compute  the  orientation  of  a  planar 
block.  We  measure  again  the  change  in  shape,  not  the  actual 
shape,  in  order  to  avoid  ambiguities  in  ground  truth  measure¬ 
ment.  Figures  6.18  and  6.19  show  the  left  and  right  images  of 
a  block.  Figures  6.20  and  6.21  show  again  the  left  and  right 
image?  of  the  block  after  the  camera  has  considerably  rotated 
in  order  to  change  the  orientation  of  the  block.  We  measure 
the  two  surface  normals,  before  and  after  the  rotation,  and 
then  we  compute  the  angle  between  the  normals,  whose 
ground  truth  we  know  from  the  rotation  of  the  camera.  In 
this  experiment  the  ground  truth  for  the  angle  between  the 
surface  normals  was  6  =  30°  and  the  estimated  one  was 
5  =  33.103°.  We  also  carried  out  experiments  for  the  compu¬ 
tation  of  motion  (only  direction  of  translation  because  unres¬ 
tricted  3-D  motion  requires  computation  of  depth  whose 
ground  truth  we  couldn’t  estimate).  The  accuracy  here  was 
better  than  5  percent  in  all  cases. 

7.  COMBINING  INFORMATION 
FROM  OTHER  CUES 

We  talked  about  combining  some  cues  to  achieve  unique 
and  robust  solutions  to  the  computation  of  some  intrinsic 
images.  Some  researchers  have  combined  cues  in  the  past  to 
compute  3-D  properties,  such  as  Waxman  and  Duncan  (bino¬ 
cular  image  flows:  stereo  and  retinal  motion),  Richards 
(stereo  and  motion)  and  Grimson  (shading  and  stereo).  The 
analysis  here  is  by  no  means  complete.  More  cues  such  as 
color,  discontinuities,  nonplanar  contours,  can  and  should  be 
combined  with  other  sources  of  information.  We  hope  that 
this  work  will  not  only  serve  as  a  source  of  technical  results 
but  also  as  a  general  methodology  in  computer  vision  prob¬ 
lems.  By  trying  to  solve  problems  in  a  correspondenceless 
way  we  do  not  mean  that  correspondence  is  useless.  On  the 
contrary,  correspondence  is  a  very  basic  process  (Ullman, 
1979]  but  it  is  a  very  hard  problem  to  solve.  We  also  hope 
that  this  research  might  generate  some  research  on  attacking 
the  hard  correspondence  problem  that  is  the  basis  of  many 
visual  processes. 

8.  CONCLUSIONS  AND  FUTURE  DIRECTIONS 

In  this  paper  we  claim  that  low-level  vision  computations 
should  be  done  in  such  a  way  that  uniqueness  and  robustness 
of  the  computations  is  guaranteed  and  that  visual  computa¬ 
tions  can  be  done  in  this  way.  We  justified  our  claims  by  exa¬ 
mining  several  problems,  such  as  shape  from  texture,  shape 
from  shading,  visual  motion  analysis,  shape  and  motion  from 
contour  and  some  cases  of  stereo. 

The  problem  of  understanding  vision  and  building  intel¬ 
ligent  machines  with  a  visual  sense  is  very  hard  and  by  no 
means  solved.  We  have  argued  that  a  very  large  part  of 
today’s  research  is  analyzing  visual  capabilities,  i.e.  research  is 
concentrating  on  topics  that  correspond  to  identifiable 
modules  in  the  human  visual  system.  And  even  though  it  is 
not  at  all  clear  what  the  topics  are  that  correspond  to 
identifiable  modules  in  the  human  visual  system,  research  has 
shown  that  shading,  texture,  motion,  contours  and  stereo  are 
cues  that  help  to  understand  the  extrapersonal  space. 

There  is  no  doubt  that  vision  is  full  of  redundancy  and 
there  is  a  lot  of  information  in  the  image  which  if  used 


correctly  will  give  rise  to  constraints  which  will  guarantee 
uniqueness  and  robustness  of  the  visual  computations.  We 
have  demonstrated  this  for  the  case  of  the  problems  that 
appeared  in  Sections  4,  5  and  6.  Obviously,  we  need  to 
obtain  robust  and  unique  visual  computations  if  we  ever  want 
to  advance  our  understanding  of  vision. 

There  is  a  standard  way  to  design  large  and  complex 
information  systems  as  research  in  computational  fields  has 
shown  |Feldman,  1985]. 

(1)  First  we  divide  the  system  into  functional  components 
which  break  the  overall  task  into  autonomous  parts,  and 
analyze  these  components  (Fig  2.3). 

(2)  Then  we  must  choose  the  representation  of  information 
within  the  subsystems  and  the  language  of  communica¬ 
tion  among  them. 

(3)  After  this,  the  details  of  the  systems  are  tested  individu¬ 
ally,  in  pairs  and  all  together. 

In  this  paper,  in  the  course  of  our  analysis  of  a  visual 
system  (machine  or  biological),  we  started  with  the  first  two 
steps  and  a  part  of  the  third,  and  we  did  this  for  some  sub¬ 
systems  (texture,  shading,  motion,  contours,  stereo).  Our 
results  can  be  summarized  best  in  Figure  3.2. 

There  are  more  subsystems  to  be  analyzed  such  as  color, 
nonplanar  contours,  recognition  of  objects,  navigation 
modules,  and  many  others.  The  analysis  of  all  of  these  consti¬ 
tutes  our  future  research,  as  well  as  the  research  of  several 
others.  More  importantly,  our  immediate  future  research  will 
be  devoted  to  the  third  step,  where  we  have  to  test  the  sub¬ 
systems  all  together.  Finally,  the  robustness  of  the  intro¬ 
duced  algorithms  should  be  established  in  a  unified  theoretical 
way.  In  the  present  work,  our  robustness  justification  was 
largely  based  on  the  fact  that  in  most  of  the  cases  redundancy 
did  the  job.  This  paper,  besides  the  technical  results,  has 
attempted  to  establish  a  paradigm  in  which  vision  problems 
should  be  addressed.  Always  keeping  in  mind  the  Marr  para¬ 
digm  [1981],  we  proposed  that  information  should  be  com¬ 
bined,  from  different  sources  when  possible,  in  order  to 
achieve  uniqueness  and  robustness  of  visual  computations  and 
at  the  same  time  making  the  minimal  number  of  assumptions. 
The  mathematical  vocabulary  for  combining  information  from 
different  sources,  which  is  deterministic,  is  that  of  discontinu¬ 
ous  regularization,  one  of  our  current  research  topics. 
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A.1.1.  Perspective  projection 

Consider  an  ideal  pinhole  at  a  fixed  distance  in  front  of 
an  image  plane  (see  Figure  A.l).  Let  us  assume  that  an  enclo¬ 
sure  is  provided  so  that  only  light  coming  through  the  pinhole 
can  reach  the  image  plane.  Given  that  light  travels  along 
straight  lines,  each  point  in  the  image  corresponds  to  a  partic¬ 
ular  direction  defined  by  a  ray  from  that  point  through  the 
pinhole.  This  is  what  we  know  as  perspective  projection.  In 
the  sequel,  in  order  to  simplify  the  resulting  equations,  we 
consider  the  nodal  point  of  the  eye  (pinhole)  behind  the  image 
plane.  This  is  only  for  simplifying  the  analysis;  all  the  results 
can  be  transformed  automatically  to  the  actual  case.  The  sys¬ 
tem  we  will  be  using  is  depicted  in  Figure  A.l.  We  define  the 
optical  axis  in  this  case  to  be  the  perpendicular  from  the 
pinhole  to  the  image  plane.  We  introduce  a  Cartesian  coordi¬ 
nate  system  with  the  origin  at  the  nodal  point  and  the  2-axis 
aligned  with  the  optical  axis  and  pointing  toward  the  image 
(Figure  A. 2).  We  would  like  to  compute  where  the  image  A' 
of  the  point  A  on  some  object  in  front  of  the  camera  will 
appear.  We  assume  that  nothing  lies  on  the  ray  from  point 
A  to  the  nodal  point  O .  Let  V  =  (A ,Y,Z),  the  vector  con¬ 
necting  O  to  A  ,  and  V'  ~(x,y,f),  the  vector  connecting  O  to 
A',  with  /  the  focal  length,  i.e.,  the  distance  of  the  image 
plane  from  the  nodal  point  O,  and  (i,y)  the  coordinates  of 
the  point  A'  on  the  image  plane  in  the  naturally  induced 
coordinate  system  with  origin  the  point  of  the  intersection  of 
the  image  plane  with  the  optical  axis,  and  the  axes  z  and  y 
parallel  to  the  axes  of  the  camera  coordinate  system  O.Y  and 
OY .  It  is  trivial  to  see  that 


Equations  (A.l)  relate  the  image  coordinates  to  the  world 
coordinates  of  a  point.  Very  often,  to  further  simplify  the 
equations  we  assume  /  =  1  without  loss  of  generality. 

A. 1.2.  Orthographic  projection 


The  orthographic  projection  model  seems  unrealistic  to 
the  eye  of  the  beginner  and  so  we  will  motivate  its  use.  If,  in 
the  perspective  projection  model,  we  have  a  plane  that  lies 
parallel  to  the  image  plane  at  Z  —  Z0,  then  we  define  as 
magnification,  mg,  the  ratio  of  the  distance  between  two 
points  measured  in  the  image  to  the  distance  between  the 
corresponding  points  on  the  plane.  So,  if  we  have  a  small 
interval  on  the  plane  (d.\  .  d)  .  O)  and  the  corresponding 
small  interval  (da,  dy ,  O)  in  the  image,  then 


So  a  small  object,  at  an  average  distance  Z0  will  produre  an 
image  that  is  magnified  by  mg.  It.  is  obvious  that  the 


000 


Figure  A.l:  Perspective  projection. 


magnification  is  approximately  constant  when  the  depth  range 
of  the  scene  is  small  relative  to  the  average  distance  of  the 
surfaces  from  the  camera.  In  this  case  we  can  simply  write 
for  the  projection  (perspective)  equations 

x  =  mX  and  y  =  mY  (A. 2) 

with  m  =  J/Za  and  Z o  the  average  value  of  the  depth  Z. 
For  our  convenience,  we  can  set  m  =  1.  Then  equations  (A. 2) 
are  further  simplified  to  the  form: 

x  —  X  and  y  =  Y  (A. 3) 

These  equa'ions  (A. 3)  model  the  orthographic  projection 
model,  where  the  rays  are  parallel  to  the  optical  axis  (see  Fig¬ 
ure  A. 3).  So,  the  difference  between  orthography  and  perspec¬ 
tive  is  small  when  the  distance  to  the  scene  is  much  larger 
th^r-  th-  variation  in  distance  among  objects  in  the  scene.  A 
rough  rule  of  thumb  is  that  perspective  effects  are  significant 
when  a  wide  angle  lens  is  used,  while  images  taken  by 
telephoto  lenses  tend  to  approximate  orthographic  projection, 
but  of  course,  this  is  not  exact  [Horn,  1986]. 


A. 1.3.  Paraperspective  projection 

The  orthographic  projection  is  a  very  rough  approxima¬ 
tion  of  the  projection  of  light  on  the  fovea,  but  it  seems 
unrealistic  for  machine  vision  applications  at  this  point.  The 
perspective  projection,  a  true  model,  sometimes  produces  very 
complicated  equations  for  most  of  the  problems  and  makes 
the  subsequent  analysis  very  hard.  The  paraperspective 
projection  is  a  very  good  approximation  of  the  perspective, 
and  stands  between  orthography  and  perspective.  A  very 
similar  form  of  the  paraperspective  projection  was  first  intro¬ 
duced  by  Ohta  et  al.  [Ohta  et  al.,  1983 [ .  Let  a  coordinate  sys¬ 
tem  OXYZ  be  fixed  with  respect  to  the  camera,  with  the  -  Z 
axis  pointing  along  the  optical  axis  and  O  the  nodal  point  of 
the  eye.  Again  we  consider  the  image  plane  perpendicular  to 
the  A'  axis  at  the  point  (0,0.— 1 )  (i.e.  focal  length  /  =  1 , 
without  loss  of  generality).  Consider  a  small  planar  surface 
patch  SP  on  a  surface  5,  with  the  planar  patch  obeying  the 
equations  -  Z  =-  pX  +  qY  —  C  (see  Figure  A.l).  Under  per¬ 
spective,  any  point  (X,Y,Z)€  SP  is  projected  onto  the  point 


Figure  A.3:  Orthographic  projection. 
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Figure  A.4:  Paraperspective  projection. 

(X / Z ,  Y/Z)  on  the  image  plane.  Let  us  now  see  how  the 
small  patch  SP  is  projected  under  the  paraperspective  projec¬ 
tion  model. 

Consider  the  plane  Z  =  d ,  where  -d  is  the  Z- 
coordinate  of  the  center  of  mass  of  the  region  SP  The  para¬ 
perspective  projection  is  realized  by  the  following  two  steps: 

a)  First,  the  small  region  SP  is  projected  onto  the  plane 

-  Z  =  d ,  which  plane  is  parallel  to  the  image  plane  and 
includes  the  center  of  mass  of  the  region  SP  The  pro¬ 
jection  is  performed  by  using  the  rays  that  are  parallel 
to  the  central  projecting  ray  0(7,  where  G  is  the  center 
of  mass  of  the  region  SP 

b)  The  image  on  the  plane  -  Z  =  d  is  now  projected  per¬ 
spective^  onto  the  image  plane.  Since  the  plane 

-  Z  —  d  is  parallel  to  the  image  plane,  the  transforma¬ 
tion  is  a  reduction  by  a  scaling  factor  I/d  (see  Figure 
A. 5  which  illustrates  a  cross  sectional  view  of  the  projec¬ 
tion  process  sliced  by  a  plane  which  includes  the  central 


projecting  ray  and  is  perpendicular  to  the  XZ  plane). 
Finally  it  is  clear  that  the  introduced  model  decomposes 
the  image  distortions  in  two  parts:  Step  (a)  captures  the 
foreshortening  distortion  and  part  of  the  position  effect, 
and  step  (b)  captures  both  the  distance  and  the  position 
effects. 

The  paraperspective  projection  process  turns  out  to  have 
nice  mathematical  properties,  since  it  is  an  affine 
transformation. 

After  having  discussed  the  geometric  correspondence 
between  points  in  the  image  and  points  in  the  scene,  we  now 
need  to  determine  the  brightness  at  each  image  point.  But  to 
do  that  we  need  some  technical  prerequisites,  which  will  be 
found  in  the  next  section  on  intrinsic  images. 

A.2.  Intrinsic  Images 

In  the  previous  sections  we  stressed  the  fact  that  a  very 
large  percentage  of  modern  computer  vision  is  exploiting  the 
recovery  of  three-dimensional  properties  (i.e.  intrinsic  images) 
from  two-dimensional  image  properties.  This  section  will 
define  mathematically  what  we  mean  by  intrinsic  images,  i.e. 
shape,  motion,  depth,  etc. 

Consider  again  a  coordinate  system  OXYZ,  fixed  with 
respect  to  a  camera  whose  nodal  point  is  the  origin  O  and 
image  plane  perpendicular  to  the  Z-axis  (which  is  also  the 
optical  axis),  with  focal  length  /.  Consider  also  the  naturally 
induced  image  plane  xy  coordinate  system,  with  origin  at  the 
point  where  the  optical  axis  intersects  the  image  plane  and 
x,y  axes  p  vile  I  to  OX  and  OY  respectively  Image  coordi¬ 
nates  will  be  denoted  by  small  letters  and  world  coordinates 
by  capital  letters.  Suppose  that  the  system  is  imaging  a  sur¬ 
face  5  with  equation  Z  =  Z(X,Y ). 

A. 2.1.  What  we  mean  by  shape 

We  will  examine  shape  under  both  orthography  and  per¬ 
spective  projection.  Surface  orientation  is  usually  represented 
as  the  surface  normal  vector  In  intrinsic  images,  shape 
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Figure  A.5:  Cross  sectional  view  of  paraperspective. 
a,  not  some  global  property  from  which 


means  the  local  surface  orientation,  not  some  global  property 
of  the  surface.  If  the  surface  is  expressed  as  Z(X,Y)  it  can  be 
reconstructed  from  the  local  shape  orientation. 

The  meaning  of  shape  under  perspective 

Consider  a  point  (^,V,2)€S  whose  image  under  per¬ 
spective  projection  is  the  point  ( x  =  fX/Z,  y  ~/Y/Z).  If  we 
say  that  we  know  the  shape  of  the  object  in  view  at  the  point 
(i,y),  we  mean  that  we  know  the  surface  normal  vector  n  of 
surface  5  at  the  point  (X,Y,Z),  in  particular 


Suppose  now  that  for  every  point  (x,y)  in  the  image  We  know 
the  surface  normal  of  the  surface  patch  whose  image  is  the 
point  (x,y).  Then,  this  new  image  (a  surface  normal  for  each 
point  (x,y)  of  the  image)  is  called  the  intrinsic  shape  image. 
But  from  only  one  image  we  can  never  hope  to  compute  the 
exact  (X,Y,Z)  point,  and  from  it  {dZ/dX,  dZ/dY,  -1). 
What  we  can  compute,  though,  is  the  quantity  (dZ/dx, 
dZ/dy),  i.e.,  the  gradient  of  the  surface  expressed  in  retinal 
coordinates.  But  then,  what  is  the  relationship  between  the 
gradient  in  retinal  and  world  coordinates,  or  in  other  words, 
what  do  we  know  when  we  know  the  quantities  (dZ/dx, 

dZ/dy )? 

Consider  a  point  (x,y)  on  the  image  and  a  small  displace¬ 
ment  in  the  image  [dx ,  dy)  from  the  point,  which  corresponds 
to  a  displacement  (dX ,  dY ,  dZ)  in  the  world,  on  the  surface 
Z  =  Z(X,  Y).  Then,  from  the  perspective  projection  equa¬ 
tions,  we  have 

dx  Z  V  X  dZ  ,  ,,,  dy  Z  +  y  dZ 
dx  — - - -  and  at  =  — = - — - 

Now,  given  that  Z(X  +  dX,  Y  +  dY)  -  Z(x  +  dx,  y  t  dy), 
and  expanding  both  sides  of  this  equation  in  a  Taylor  series 
and  ignoring  the  higher  order  terms,  we  get 

_ £ _ *  +  M _ l _ dy 
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From  the  above  equations  it  is  easy  to  see  that  if  6Z/SX, 
6Z/6Y  are  known,  then  the  quantity 

Z(x  +  dx,  y  +  dy) 

*(*.») 

is  computable.  But  this  means  that  if  the  surface  normals  are 
known  indexed  by  retinal  coordinates,  then  the  depth  func¬ 
tion  ( Z(x,y ))  can  be  computed  up  to  a  constant  factor.  In 
other  words,  if  shape  is  known,  then  for  any  two  points 
(x,  ,y, )  and  (ij, y-)  on  the  image,  we  know  the  ratio 

z(*i>Vi) 
z(xjXj)  ’ 

So,  an  object  whose  shape  we  know  under  perspective  projec¬ 
tion  can  be  small  and  near  the  camera  or  large  and  far  away. 

The  meaning  of  shape  under  orthography 

Under  orthographic  projection,  the  image  coordinates  of 
a  point  are  equal  to  the  corresponding  3-D  coordinates,  i  e 
(r,y)  =  (.Y,r).  So 


Obviously,  if  we  know  shape  in  this  case,  since 


Z(x  +  dx.  y  +  dy)  Z{x,y)  =  ^ '-dx  +  ^r~dy  4-  (h.o.t), 

dx  By 


we  know  that  the  depth  function  can  be  computed  up  to  con¬ 
stant  additive  term  So,  if  we  know  shape  under  orthogra¬ 
phy,  we  know  exactly  the  object  in  view,  but  we  do  not  know 
its  depth. 

Other  representations  for  shape 

We  have  stated  that  the  surface  normal 
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respect  to  the  camera  coordinate  system.  The  direction 
{lz  Jy  ,l2 )  is  called  the  lighting  or  illuminant  direction. 


with  p  =  6Z 1 6X }  q—-6Z/6Y  at  a  point  of  a  surface 
Z  =  Z(X,Y)  represents  the  shape.  This 


dZ  dZ 

ax '  a  y 


is  not  the  only  representation.  Obviously  shape  is  nothing 
but  a  direction  in  three-dimensional  space,  and  sc  there  are 
many  representations  for  it.  The  ones  that  we  will  use  quite 
often  in  this  paper  are,  with  the  exception  of  the  gradient 
that  we  have  already  analyzed,  the  following: 

a)  Coordinates  (a,b,c)  on  the  Gaussian  sphere. 

b)  Latitude  and  longitude  angles,  say,  ( 0,p ). 

c)  Slant  and  tilt.  Slant  is  the  tangent  of  the  latitude  angle 
and  tilt  is  the  longitude  angle.  The  notation  for  (slant, 
tilt)  is  (<r,r).  The  slant  and  tilt  are  polar  versions  of  the 
( p,q )  coordinates. 

The  relationship  among  these  different  representations  is 
given  by  the  following  equations: 

a  —  tanfl  =  \/(p2  +  q2) 


p  /  q  —  tan<p  — ■  tan  t 

Finally,  if  (a,6,c)  are  the  coordinates  on  the  Gaussian  sphere, 
then 

(a,b,c)~  |  "jr  ■ “F" )  W‘ttl  k  (P'  +  ?*  +  O' 


A.2.2.  What  we  mean  by  retinal  motion 

If  the  object  in  view  is  moving  with  a  general  motion,  or 
if  the  camera  is  moving,  or  if  both  move,  then  the  image  is 
moving  too  Let  the  retinal  velocity  at  an  image  point  be 
(«,c).  The  resulting  vector  field  (the  velocity  of  every  image 
point)  is  called  the  retinal  motion  field  or  optic  How  field 
This  flow  field  is  an  intrinsic  retinal  motion  image. 

A.2.3.  What  we  mean  by  depth 

Consider  again  a  surface  .S'  with  equation  Z  Z{\,)  ) 
in  front  of  the  camera  Every  point  (r.t/)  in  the  image  is  the 
projection  of  a  point  (.V,  )’,Z)  £  .S’,  If  for  every  point  (i, y)  on 
the  image  we  know  the  Z  coordinate  (depth)  of  the 
corresponding  3-1)  point  (.V,  Y,Z),  then  we  know  exactly  where 
the  surface  is  with  respect  to  the  camera  coordinate  system 
The  resulting  image  (for  every  point  in  the  image  there 
corresponds  a  number  (depth)  of  the  corresponding  3-1)  point) 
is  called  the  intrinsic  depth  image. 

A.2.4.  Intrinsic  parameters  that  are  not  images 

There  exist  intrinsic  parameters  which  do  not  correspond 
to  every  point  in  the  image.  These  are  global  constants  and 
every  point  in  the  image  is  in  some  relation  to  them  Exam¬ 
ples  of  these  parameters  are  the  3-1)  motion  and  lighting 
direction  parameters. 

3-D  motion  parameters 

If  an  object  moves  in  front  of  a  camera  with  a  general 
motion,  then  this  motion  can  be  considered  as  the  sum  of  a 
translation  (('  EM  )  and  a  rotation  (.t, /(,(')  These  six 
parameters  will  be  called  motion  parameters. 

Lighting  direction  parameters 

Consider  again,  a  surface  in  front  of  a.  camera, 
illuminated  bv  a  light  source  in  the  direction  with 


A. 2. 5.  A  synopsis 

Up  to  this  point  we  have  defined  mathematically  so- 
called  intrinsic  parameters.  These  are  shape,  depth,  retinal 
motion,  3-D  motion,  and  light  source  direction.  This  of 
course  does  not  mean  that  these  are  the  only  intrinsic  param¬ 
eters.  There  can  be  many  more  but  the  ones  that  we 
described  here  are  the  ones  which  we  (and  contemporary 
research)  think  are  the  most  important  for  the  perception  of 
the  outside  world.  Again,  we  do  not  want  to  get  involved  in 
philosophical  arguments  about  why  these  intrinsic  parameters 
are  important  to  compute  for  visual  perception.  The  shape  of 
objects  is  important  for  the  recognition  of  objects  that  we  see, 
the  depth  of  objects  is  important  for  our  interaction  with  the 
environment  (picking  up  things),  retinal  motion  is  important 
for  understanding  discontinuities  and  segmenting  the  environ¬ 
ment  as  well  as  for  the  computation  of  the  3-D  motion  which 
is  important  for  navigation  and  for  understanding  the  motion 
of  objects  in  our  environment  as  well  as  for  avoiding  moving 
objects. 

There  may  very  well  be  other  important  intrinsic  param¬ 
eters  that  we  haven’t  discovered  yet.  There  may  also  be  no 
more  intrinsic  parameters  of  interest.  Further  research  will 
uncover  the  truth  of  this  matter. 


A.3.  Brightness  at  Every  Image  Point 

In  this  section  we  analyze  how  the  brightness  at  every 
image  point  is  determined.  The  amount  of  light  reflected  by 
a  surface  element  depends  on  its  microstructure,  on  its  optical 
properties  and  on  the  distribution  and  state  of  polarization  of 
the  incident  illumination.  For  several  surfaces,  the  fraction  of 
incident  illumination  reflected  in  a  particular  direction 
depends  only  on  the  surface  orientation  The  charadcis'l'-s 
of  the  reflectance  of  such  a  surface  can  be  represented  as  a 
function  f(i,g,e)  of  the  angles  i  =  incident,  g  =  phase  and 
e  —  emergent,  as  they  are  defined  in  Figure  A  6 

The  reflectance  function  f{i,g,e)  determines  the  ratio  of 
surface  radiance  to  irradiance  measured  per  unit  surface  area, 
per  unit  solid  angle,  in  the  direction  of  the  viewer.  If  we  want 
to  be  precise,  we  should  specify  the  quantities  and  units  used 
to  define  the  required  ratio.  Mere  it  is  sufficient  to  point  out 
the  role  that  surface  orientation  plays  in  the  determination  of 
the  angles  i  and  g. 


Consider  the  example  of  perfect  specular  (mirror-like) 
reflection.  In  this  case,  the  incident  angle  equals  the  emergent 
angle  and  the  incident,  emergent  and  normal  vectors  lie  on 
the  same  plane  (g  it  e  )  So,  the  reflect  a  nee  function  is 


!(>.e.g ) 


I  if  i  e.  and  i 
0,  otherwise 


9 


The  interaction  of  light  with  surfaces  of  varying  roughness 
and  composition  of  material  leads  to  a  more  complicated  dis¬ 
tribution  of  reflected  light.  Surface  reflectance  characteristics 
can  be  determined  empirically,  derived  from  models  of  surface 
micros! ructure  or  derived  from  phenomenological  models  of 
surface  reflectance.  'Hie  most,  widely  used  model  of  surface 
reflectance  is  given  by  the  function  f(i,e,g)  pcosi ,  where  /> 
is  a  constant  depending  on  the  specific  surface.  'Phis 
reflectance  function  corresponds  to  a  phenomenological  model 
of  a  perfectly  diffuse  (Lambert ian)  surface  which  appears 
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Figure  A. 6:  Reflectance  model 


equally  bright  from  all  viewing  directions;  the  cosine  of  the 
incident  angle  accounts  for  the  foreshortening  of  the  surface 
as  seen  from  the  source. 

The  surface  normal  vector  relates  surface  geometry  to 
image  irradiance  because  it  determines  the  angles  i  and  e 
appearing  in  the  surface  reflectance  function  }(i,e,g).  In 
orthographic  projection,  the  viewing  direction  and  so  the 
phase  angle  g  is  constant  for  all  surface  elements.  So,  for  a 
fixed  light  source  and  viewer  geometry  and  fixed  material,  the 
ratio  of  scene  radiance  to  scene  irradiance  depends  only  on 
the  surface  normal  vector  Furthermore,  suppose  that  each 
surface  element  receives  the  same  irradiance.  Then,  the  scene 
radiance  and  hence  image  intensity  depends  only  on  the  sur¬ 
face  normal  vector  A  reflectance  map  K(p.q)  determines 
image  intensity  aw  a  function  of  p  and  q  (where  (p,  q , 
1)  is  the  surface  normal  vector).  Using  a 


the  surface  normal  at  the  point  whose  image  is  the  point  (x,y) 
and  the  light  source  direction  respectively,  under  orthographic 
projection.  Under  perspective  projection,  the  model  is  not  yet 
known  exactly. 


Appendix  B 


Proposition 

Let  a  coordinate  system  OXYZ  be  fixed  with  respect  to 
the  left  camera,  with  the  Z  axis  pointing  along  the  optical 
axis.  Let  the  nodal  point  of  the  right,  camera  be  the  point 
(Si,  Sy ,  O)  and  its  image  plane  be  parallel  to  the  previous 
one,  at  Z  =  1.  Consider  a  region  f!  in  the  world  plane 
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rclfei  tance  map.  an  image  irradiance  equation  can  be  written 
as  Id.yt  /i’(p.</).  where  l(i,y\  is  the  intensity  at  the  image 
p'.mt  | j\y)  and  l\’[p.q )  is  the  corresponding  reflectance  map. 

A  reflectance  map  provides  a  uniform  representation  for 
specifying  the  surface  reflectance  of  a  surface  material  for  a 
particular  light  source,  object  surface  and  viewer  geometry.  A 
comprehensive  survey  of  reflectance  maps  derived  for  a 
variety  of  surface  and  light  source  conditions  has  been  given 
by  Horn  Horn.  1977  .  Furthermore,  a  unified  approach  to  the 
specification  of  surface  reflectance  maps  has  been  given  in 
Horn  and  Sjoberg.  1981  , 

Expressions  for  cost  ,  rose  and  cosy  can  be  easily  derived 
from  the  surface  normal  vector  (p ,  q ,  I)  ^nd  the  light 
source  vector  ( p q,,  II  and  the  vector  (0,  0,  -1)  which 
points  in  the  direction  of  the  viewer  For  a  Lambertian 


the  perspective  projection  of  fi  in  the  left,  and  right  cameras 
respectively  and  let  A  A  5  be  their  respective  centroids  Then 
we  have 


50  I  -  pA  2  -  qB  2 
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Proof: 

If  (x',y')  denote  the  coordinates  on  the  surface  plane  and 
(x,,y,)  i  =  l.  2  the  coordinates  on  image  planes  1  and  2 
respectively  and  assuming  that  the  foci  of  the  two  cameras 
are  at  the  points  (0,0)  and  ( 8 j , )  respectively  with  coinciding 
image  planes  z  =  -  1,  we  have 
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Abstract 

This  paper  is  motivated  by  the  need  to  obtain  the  three  pa¬ 
rameters  of  absolute  3D  motion  in  depth  of  objects  in  a  dynamic 
scene  using  visual  information  Motion  in  depth  (MID)  param¬ 
eters  refer  to  the  following  three  components  of  the  object  or 
camera  motion,  i.e.,  the  single  component  of  translation  in  depth 
(parallel  to  the  line  of  sight)  and  the  two  components  of  rotation 
in  depth  (or  rotational  components  that  are  not  about  the  line 
of  sight).  In  this  paper  we  show  that  use  of  stereoscopic  mo¬ 
tion  enables  the  MID  parameters  to  be  computed  in  a  quick  and 
robust  manner  for  a  stereo  camera  system  moving  through  an 
environment  that  may  have  several  independently  moving  rigid 
bodies  It  has  been  shown  elsewhere  that  use  of  temporal  binoc¬ 
ular  imagery  permits  the  formulation  of  the  ratio  of  the  relative 
optic  flow  between  a  stereo  pair  of  images  and  the  disparity  be¬ 
tween  them  as  a  linear  function  of  the  image  coordinates.  The 
coefficients  of  this  linear  function  are  the  translation  in  depth 
and  rotation  in  depth  components  which  are  precisely  the  MID 
parameters  being  computed  in  our  approach. 


1.  Introduction 

Motivation 

The  human  visual  system  exhibits  remarkable  skill  in  detect¬ 
ing  the  translation  in  depth  of  moving  objects.  Consider,  for  in¬ 
stance,  the  following  tw'o  commonly  performed  tasks  -  catching 
a  ball  that  is  thrown  toward  the  observer  with  a  rotational  spin 
component  contributing  to  the  motion,  and  navigating  around 
obstacles  moving  toward  or  away  from  the  observer.  The  ob¬ 
server  has  to  make  rapid  judgements  about  the  translation  in 
depth  of  the  object  in  question.  Determination  of  the  complete 
trajectory  of  the  object  involves  determination  of  all  the  param¬ 
eters  of  motion  and  is  less  important  to  the  immediate  needs  of 
the  observer  It  is  possible  to  make  such  judgements  «>f  transla¬ 
tion  in  depth  even  when  the  object  has  other  motion  attributes, 
such  as  rotations  about  different  axes  contributing  to  the  actual 
trajectory  Perception  of  the  translation  in  depth  of  the  spinning 
ball  moving  toward  the  observer  i«  one  <nch  exatnph  llen<  *\  w« 
believe  that  t  lie  translation  in  depth  of  object*  i-  a  part  lcularlv 
useful  motion  parameter  to  rornputr  for  both  navigation  ami 
moving  obstacle  avoidance 

Also  of  significant  interest  to  ns  is  the  psychophysical  evi¬ 
dence  that  demonstrates  the  perception  of  rotation  m  depth  by 
human  observers  Ther°  are  specific  empirical  findings  9.17.19 
that  are  concerned  with  the  ability  of  human  observers  to  per¬ 
ceive  three  dimensional  relationships  on  the  basis  of  rotation  m 


depth.  The  emphasis  in  the  above  work  has  been  on  both  the 
perception  of  the  direction  of  rotation  as  well  as  on  the  overall 
impression  of  depth  or  coherence  elicited  by  the  rotating  objects. 
Computational  models  dealing  with  the  structural  information 
that  can  be  extracted  from  such  an  image  transformation  have 
been  developed  [28,29] .  We  believe  that  the  perception  of  rota¬ 
tion  in  depth  of  objects  in  the  image  can  be  used  to  guide  further 
semantic  analysis  of  the  scene  in  order  to  yield  structural  as  well 
as  motion  descriptions  of  semantically  meaningful  objects  in  the 
scene. 

Motivated  by  this,  we  propose  here  a  model  of  motion  in 
depth  computation  that  permits  the  presence  of  general  motion 
of  both  the  camera  and  multiple  objects  in  the  environment. 

Approach 

Motion  in  depth  (MID)  parameters  refer  to  the  follow-ing 
three  components  of  the  object  or  camera  motion:  the  single 
component  of  translation  in  depth  (along  the  line  of  sight)  and 
the  two  components  of  rotation  in  depth  (rotations  that  are  not 
about  the  line  of  sight). 

The  model  uses  stereo  time-varying  optic  flow  fields  to  per¬ 
form  a  fast  and  robust  computation  of  the  parameters  that  spec¬ 
ify  the  absolute  MID  values.  It  is  based  on  the  concept,  of  stereo¬ 
scopic  motion,  described  by  Beverly  and  Regan  [8'.  Stereoscopic 
motion  refers  to  the  relative  motion  present  between  the  stereo 
pair  of  a  temporal  image  sequence.  Psychophysical  evidence  !8i 
indicates  that  stereoscopic  motion  is  used  by  human  observers 
to  extract  the  translation  in  depth  of  objects  in  the  scene 

In  the  proposed  model,  therefore,  we  consider  the  dynamic,* 
of  a  scene  as  viewed  by  a  stereo  pair  of  cameras  Tin-  i*  di*- 
tinct  from  the  analysis  of  the  static  imagery  of  t  he  «ieren  pair  at 
one  time  instant  to  obtain  the  static  disparity  between  the  two 
images.  The  model  assumes  for  its  input,  the  existence  of  t lu¬ 
st  at  ir  disparity  field  corresponding  to  the  first  <ier«’o  p,!r  nf  the 
two  frame  stereo  sequence  along  with  t  lie  non.  fi,,w  computed 
seperatcly  for  each  of  the  two  stereo  image  sequences  These  are 
used  to  determine  the  relative  image  change  due  to  motion  in  the 
temporal  binocular  image  sequence  Issues  dealing  with  the  cor¬ 
respondence  problem  in  stereo  and  motion  analysis  are  bypassed 
for  the  purposes  of  the  current  work  Hence,  the  model  proposes 
an  integrated  interpretation  of  the  disparity  and  optic  flow  infor¬ 
mal  ion  as  opposed  to  an  integrated  analysis  of  stereo  and  motion 
that  deals  with  correspondence  issues  The  latter  approach  has 
been  adopted  in  other  work  in  this  area  13.30  .  'These  and  other 
related  work  have  been  discussed  in  the  following  section 

It  was  shown  by  Waxman  and  Duncan  30  that  use  of  tem¬ 
poral  binocular  imagery  permits  the  formulation  of  a  linear  re- 
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lationship  between  the  relative  Row  and  disparity  values  and  the 
MID  parameters.  This  was  used  to  establish  correspondence  be¬ 
tween  stereo  pairs  of  images.  We  believe  this  to  be  insufficient  use 
of  the  information  that  is  available  from  the  linear  relationship 
and  hence  propose  to  use  this  formulation  as  the  theoretical  ba¬ 
sis  for  our  approach  to  computing  the  MID  parameters  of  object 
or  sensor  motion  for  a  sensor  travelling  through  an  environment 
that  may  have  independently  moving  objects. 

A  two-stage  global  technique  is  employed  here  to  compute 
the  three  motion  in  depth  parameters.  In  the  first  stage,  we 
employ  the  segmentation  technique  developed  by  Adiv  [1,2]  as 
the  initial  stage  of  his  algorithm  for  interpreting  monocular  optic 
How  fields.  We  perform  this  segmentation  in  order  to  obtain  a 
2-D  grouping  of  the  optic  flow  field.  In  the  second  stage,  seg¬ 
ments  from  the  first  stage  are  grouped  into  regions  that  corre¬ 
spond  to  the  same  set  of  three  MID  parameters  by  employing  a 
least  squares  minimization  technique  on  the  relative  optic  flow 
between  the  stereo  pair  of  image  sequence. 

Note  that  such  a  two-step  view  of  3-D  motion  interpretation 
essentially  follows  the  viewpoint  of  Adiv  jlj.  Performing  segmen¬ 
tation  on  the  2D  image  plane  provides  a  method  of  restricting 
the  actual  3D  interpretation  mechanism,  (i.e.,  the  global  mini¬ 
mization  step  in  the  current  model),  to  a  semantically  relevant 
set  of  flow  vectors  and  thus  contributes  to  the  robustness  of  the 
entire  algorithm.  A  direct  attempt  at  interpreting  the  3D  mo¬ 
tion  from  the  2D  flow  without  an  intermediate  grouping  would 
necessarily  be  a  local  analysis  of  the  f1  v  and  hence  be  very  sus¬ 
ceptible  to  noise.  On  the  other  hand,  it  is  important  to  note 
that  this  segmentation  step  uses  an  approximation  to  determine 
the  grouping.  Hence,  it  is  important  not.  to  use  this  stop  to  di¬ 
rectly  compute  the  values  of  the  3D  motion  parameters,  but  use 
this  grouping  as  a  mask  on  the  relative  flow  field  in  order  to  de¬ 
termine  the  values  of  the  MID  parameters  without  making  any 
surface  approximations. 

MID  parameters  can  he  obtained  "roni  a  general  3D  motion 
computation  algorithm  using  monocular  imagery,  but  these  al¬ 
gorithms  are  computationally  intensive  and  do  not  permit  quirk 
reliable  ,'omputation.  One  of  the  chief  reasons  for  this  is  that  the 
flow  information  to  the  depth  and  motion  parameters  via  non¬ 
linear  equations.  The  main  advantage  provided  by  the  proposed 
integrated  interpretation  of  stereo  and  motion  information  for 
extracting  motion  in  depth  is  that  stereo  information  provides 
additional  corrtraints  that  make  if  possible  to  formulate  a  linear 
relationship  between  the  data  in  the  flow  and  disparity  fields  and 
the  motion  in  depth  parameters.  This,  in  turn,  makes  it  possible 
to  devise  a  direct,  computation  without  hypothesizing  any  of  the 
motion  parameters  such  as  in  [1,20].  This  can  conceivably  be 
used  to  directly  compute  the  remaining  three  motion  parame¬ 
ters.  again,  without  employing  a  hypothesize  and  test  scheme. 
Hence,  extracting  all  the  motion  parameters  would  become  a 
problem  of  handling  two  sets  of  linear  functionals,  and  we  plan 
to  demonstrate  this  in  future. 

In  Section  2.  we  review  existing  techniques  for  dealing  with 
the  problem  of  integrating  stereo  and  motion  information.  The 
assumptions  and  limitations  of  these  techniques  are  discussed. 
We  also  briefly  review  current  motion  interpretation  research. 
In  Section  3.  we  formulate  the  theoretical  model.  In  Section  4. 
we  describe  the  algorithm  and  discuss  the  decisions  that  were 
made  in  the  development  and  implementation  of  the  algorithm 
Experiments  based  on  simulated  data  are  described  in  Section 
6  These  experiments  demonstrate  the  generality  of  the  algo¬ 


rithm.  In  future  work,  we  plan  to  demonstrate  the  algorithm 
on  real  data.  Anandan’s  algorithm  [4,5]  which  has  shown  state 
of  the  art  performance  on  real  imagery,  will  be  used  to  extract 
the  optic  flow  fields.  A  modified  version  of  the  same  algorithm 
can  be  used  to  extract  the  disparity  values  as  well.  In  Section 
7,  we  summarize  the  approach  and  major  results,  and  discuss 
directions  for  future  research. 

2.  Literature  Survey 

In  this  section  we  examine  the  approaches  taken  by  several 
authors  to  the  problem  of  interpreting  temporal  binocular  im¬ 
agery.  The  research  over  the  years  has  chiefly  been  on  the  seper- 
ate  interpretation  of  optic  flow  in  monocular  imagery  or  on  static 
stereo  analysis.  An  extensive  review  of  work  on  the  interpreta¬ 
tion  of  monocular  optic  flow  is  given  in  i7|,  and  more  recent  work 
includes  that  of  '3,3lj. 

Any  method  for  the  interpretation  of  monocular  optic  flow 
fields  is  limited  by  the  fact  that  the  flow  value  at  a  point  in  the 
image  is  a  nonlinear  function  of  the  siy  parameters  of  motion 
and  the  depth  of  the  corresponding  environmental  point.  Deal¬ 
ing  with  this  imposes  restrictions  on  any  interpretation  tech¬ 
nique.  For  instance,  iterative  methods  such  as  [10,20.26]  need 
good  initial  guesses.  Sensitivity  to  noise  is  reported  by  much 
of  the  earlier  research  e.g. ,[  10.20,25,27).  Some  techniques,  such 
as  ( 15]  and  [lli,  either  assume  restricted  motions  such  as  pure 
translation  or  assume  that  one  component  of  the  motion  such 
as  the  rotation  may  be  known  and  can  be  subtracted  out  prior 
to  the  computation.  Others  deal  with  a  stationary  environment 
and  moving  camera,  disallowing  the  possibilty  of  multiple  inde¬ 
pendently  moving  objects  (24 j. 

Some  of  the  more  successful  general  methods  that  deal  with 
multiple  independently  moving  objects,  as  well  as  the  presence 
of  genera!  motion,  can  be  found  in  [1.2]  and  [31].  Good  results 
in  determining  3  D  motion  and  object  structure  for  simulated 
data  have  been  reported  in  (3l|.  although  computation  of  the 
local  derivatives  of  the  flow  may  be  highly  sensitive  to  noise. 

The  work  of  Adiv  (1,2;  appears  to  be  robust.  However,  the 
technique  adopted  for  the  computation  of  the  motion  parameters 
is  computationally  intensive  and  does  not  permit  quick  evalua¬ 
tion  of  at  least  some  of  the  motion  parameters  that  would  be 
important  in  a  practical  context.  This  is  a  defect,  that  is  in¬ 
herent  to  any  method  that  deals  with  monocular  imagery  due 
to  the  underconstrained  nature  of  the  relationship  between  the 
flow  field  and  the  information  desired  from  it.  In  the  case  of 
the  method  adopted  in  1,2!,  for  instance,  the  motion  param¬ 
eters  are  computed  only  at  the  end  of  a  search  '  f  a  sampling 
of  possible  translation  directions  and  corresponding  optimal  ro¬ 
tation  parameters,  an  approach  that  is  essentially  a  bottom-up 
hypothesize  and  test,  scheme  in  practice. 

Any  method  that  is  directed  toward  quickly  extracting  mo¬ 
tion  parameters,  while  retaining  the  ability  to  deal  u  it  h  mult  iph- 
objects  and  general  motion  has  to  deal  with  1  he  fact  that  monoc¬ 
ular  imagery  with  just  two  frames  provides  tindercon^t rained 
information  There  is  a  need  to  look  for  additional  --mrer-  of  in¬ 
formation  that  will  provide  more  constraint-.  Ft  «  output  mg  tin- 
mot  ion  parameters.  This  leads  to  the  use  of 

•  multiple  frames  of  monocular  imagery,  and 

•  ct ereo  time-varying  imagery 


BBUNTfOTit* ilTSW JCTSTCT 5C* HTSWSIKTCTWrKSKTOnCKTCB VJ» fCF KH kSH." KJ TO mW.V'Wi' V"  JCT CT R?  VL»  n U TST5 TV'\T,.’ TT*V YV! TT 


I 

i 

We  now  briefly  review  the  previous  work  that  addresses  the  lat- 
'  ter  aspect  of  integrating  stereo  and  motion  analysis.  Techniques 

t  have  been  formulated  to  facilitate  the  solution  to  the  correspon¬ 

dence  problem  for  both  stereo  and  motion  as  well  as  to  address 
*  interpretation  issues. 

►  The  first  use  of  the  term  stereoscopic  motion  was  in  the  work 

^  of  Regan,  ot  al.  [ 2 1 f .  They  addressed  the  problem  of  using 

visual  information  to  recover  motion  in  depth  of  objects,  and 
f  provide  evidence  to  support  models  of  neural  organizations  in 

the  human  visual  system  for  detecting  the  motion  in  depth  of 
objects.  They  propose  the  presence  of  neural  “filters”  that  are 
sensitive  to  the  relative  velocities  in  the  left  and  right  retinal 
images  and  are  thereby  selectively  sensitive  to  the  direction  of 
motion  in  depth  of  objects  in  the  visual  field.  These  motion-in- 
depth  detectors  are  thus  viewed  as  binncularly  driven  channels 
that  process  the  changing  disparity.  Our  computational  model 
has  been  motivated  by  the  psychophysical  evidence  that  strongly 
supports  such  a  formulation.  Also  of  interest  to  us  is  pychophys- 
ical  evidence  in  [21]  for  the  following  - 

•  changing  disparity  grows  relatively  more  effective  as  the 

'  velocity  increases,  and 

•  changing  disparity  grows  relatively  more  effective  as  the 
inspection  time  increases. 

In  future  work,  we  will  examine  the  relevance  of  these  conclusions 
to  the  proposed  model. 

Other  work  of  Regan  '22|  indicates  that  the  human  visual  sys¬ 
tem  possesses  sensitivity  to  four  kinds  of  relat  ive  motion,  namely, 

•  the  velocity  difference  between  two  points  in  one  retinal 
image  along  the  line  joining  the  two  points, 

•  the  velocity  difference  between  the  two  points  in  one  retinal 
image  perpendicular  to  the  line  joining  them. 

•  rotary  motion,  and 

•  the  ratios  of  the  velocities  of  the  left  and  right  retir.ai  im¬ 
ages  of  an  object  moving  in  depth. 

We  note  that  the  latter  sensitivity  may  be  w<ed  i»  plover  ob¬ 
ject  motion  in  depth.  An  interesting  alternate  viewpoint  pro¬ 
vided  here  is  that  these  filters  may  provide  physiological  means 

of  analysing  the  local  flow  in  the  retinal  images  in  order  to  re¬ 
cover  specific  information  about  the  environment. 

The  approach  taken  by  Richards  [2T  to  integrate  stereo 
and  motion  information  uses  both  to  extract  three  dimensional 
information  about  the  environment  in  a  manner  that,  results  in 
a  unique  solution  for  the  object  structure.  The  problem  with 
solving  for  structure  from  pure  motion  information  is  that  any 
algorithm  needs  to  deal  with  second  order  equations  relating 
the  flow  to  the  structure.  This  will  result  in  the  possibility  of 
multiple  interpretations  of  object  configurations.  It  is  required 
that  these  be  removed  by  disambiguating  the  possible  solutions. 
The  problem  noted  with  stcreopsis  is  that  the  correct  configura¬ 
tion  of  objects  can  be  determined  only  if  the  fixation  distances 
are  known.  This  is  because  the  same  configuration  at  different 
distances  will  result  in  different  angular  disparities  Motion  in¬ 
formation  can  correctly  interpret  the  angular  relations  between 
objects,  thus  making  knowledge  of  the  fixation  distance  unneces¬ 
sary  The  approach  uses  stereopsis  to  provide  information  which 
disambiguates  the  multiple  interpretations  found  with  pure  mo¬ 
tion  since  stereo  can  provide  absolute  depth  information  We 


note  that  this  work  deals  purely  with  the  problem  of  extracting 
the  structure  of  the  environment  and  does  not  deal  with  the  use 
of  stereo  and  motion  information  to  extract  the  motion  param¬ 
eters  of  objects  in  the  environment  or  of  the  camera. 

The  approach  of  Jenkin  (13|  uses  stereo  and  motion  infor¬ 
mation  in  a  prediction-correction  mechanism  to  facilitate  the 
solution  to  the  correspondence  problem  both  in  stereo  and  mo¬ 
tion  analysis.  The  algorithm  predicts  the  position  that  a  point  in 
the  current  frame  might  correspond  to,  using  the  previous  mo¬ 
tion  of  that  point,  i.e.,  by  using  the  previous  frames.  Hence,  the 
stereopsis  correspondence  problem  is  simplified  since  the  search 
area  is  restricted  to  that  predicted  by  the  analyses  of  the  previ¬ 
ous  frame  pairs.  We  note  that  the  algorithm  addresses  only  the 
correspondence  problem  for  both  stereo  and  motion,  and  not  the 
interpretation  of  stereo  or  motion  information. 

A  unified  approach  to  the  analysis  of  stereo  and  motion  data 
is  given  by  Waxman  and  Duncan  (30 :  They  show  that  t bon¬ 
is  a  correlation  between  the  stereo  disparity  and  the  relative 
flow  between  the  stereo  pair  of  an  image  sequence  This  result 
is  used  to  establish  correspondence  in  the  context  of  local  sup¬ 
port  from  neighbourhoods.  It  is  proposed  that  in  the  analysis  of 
time-varying  stereo  imagery,  after  the  initial  correspondence  is 
established,  matching  for  subsequent  images  need  be  performed 
only  at  the  peripheral  regions  of  the  image  as  well  as  around 
occluding  boundaries.  We  note  that  this  work  approaches  the 
integration  of  stereo  and  motion  information  from  the  viewpoint 
of  facilitating  the  correspondence  problem. 

An  entirely  different  approach  using  two  frames  of  stereo  im¬ 
agery  is  taken  by  Huang  and  Blostein.  |l2j.  They  estimate  the 
rigid  body  rotation  and  translation  parameters  by  matching  3-D 
points  determined  at  two  time  instances  from  stereo  information. 
This  3-D  matching  problem  can  be  solved  by  considering  geo¬ 
metric  relationships  that  should  be  preserved  under  rigid  body 
displacements.  The  2-D  matching  problem  continues  to  persist 
for  the  stereo  matching,  while  the  motion  estimation  algorithm 
needs  to  deal  with  appropriate  mechanisms  for  3  D  matching 

Mutch  f  1 8 1  describes  a  technique  to  recover  the  translation 
in  depth  of  a  moving  object  point,  using  the  concept  of  stereo¬ 
scopic  motion  as  described  by  [8] .  The  change  in  the  location  of 
the  image  of  an  object  point  is  defined  by  ils  “change  vector.” 
The  relative  difference  between  two  change  vectors  in  the  left 
and  right  image  sequences  corresponding  to  a  translating  object 
point  is  such  that  their  relative  orientation  and  magnitude  leads 
to  the  perception  of  certain  3  1)  properties  of  the  object .  includ¬ 
ing  its  translation  in  depth  The  direction  of  the  translation  in 
depth  as  well  as  the  position  of  I  lie  impact  point  can  be  deter¬ 
mined  from  this  method.  We  note  the  approach  requires  that 
in  the  case  of  general  motion  by  the  object  point,  the  rotational 
component  first  be  removed  from  the  change  vector  This  work 
is  of  interest  since  it  uses  stereoscopic  motion  in  a  computational 
context,  to  determine  the  translation  in  depth  of  objects. 

We  give  a  more  general  approach  to  the  problem  of  interpret¬ 
ing  stereo  and  motion  information  by  being  able  to  deal  with 
general  motion  and  the  presence  of  multiple  independently  mov¬ 
ing  objects.  Also,  we  do  not  restrict  the  point  of  interest  to  a 
single  object  point,  such  as  the  centroid  in  |N  .  but  develop  the 
technique  for  dense  How  and  disparity  field*  This  is  more  robust 
since  more  global  informal  ion  is  allowed  in  t  hr-  rninpui  al  ion  fi¬ 
nally.  we  extract  translation  as  well  a«  rotation-  m  dept  h  >>f  t  h< 
camera  and  the  objects 
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3.  Theoretical  Formulation  Of  The  Model 

In  this  section  we  develop  the  mathematical  framework  for 
our  approach  We  first  consider  monocular  flow  analysis,  and 
then  generalize  binocular  flow. 

Monocular  Flow  Analysis 

Given  the  optic  flow  on  the  image  plane,  we  can  relate  the 
values  of  the  components  of  the  flow  field  at  every  point-  in  t lie 
image  to  the  3  D  motion  parameters  and  depth  of  the  environ¬ 
mental  point  that  projects  to  this  image  point. 

Let  us  consider  a  cartesian  coordinate  system  (A\V\Z)  that 
is  fixed  with  respect  to  the  camera  with  the  focal  length  normal¬ 
ized  to  a  distance  of  l  from  the  origin,  O.  to  the  image  plane.  Let 
(x.y)  represent  the  image  coordinate  system.  The  perspective 
projection  of  an  environmental  point.  (A\V.  Z),  on  the  image 
plane  is  then  given  b\ 

V  /.  (1) 

Y  Z  (2) 

We  now  consider  the  nn  lion  relative  to  the  camera  of  a  rigid 
object  in  the  environment  Let  !’  be  the  position  vector  of  some 
point  on  the  object .  with  camera  coordinates  given  by  (X,Y,Z) 
(see  Figure  1)  Since  the  object  is  rigid,  the  instantaneous  ve¬ 
locity.  of  /'  can  tie  represented  as  an  combination  of  an  in¬ 
finitesimal  rotation  U.  and  an  infinitesimal  translation  T.  Thus. 

P  -  0  •  P  ~  T.  (3) 

In  components,  this  becomes  : 

/  -V  /  n Yz  nzY  v  r.v  \ 

I  V  =  nz.Y  -S1*Z^  Ty 
\z)  \  n.vV  fiy.Y  -*  Tz  j 

The  corresponding  image  point  (x,  y)  has  an  image  velocity  given 
by  (a,  /?),  where 
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Figure  I:  Monocular  Camera  Configuration 
which  becomes,  from  (3)  : 

o  =  -fl.Yxy  t  Of  (I  *-  x-)  Hjry  ( /  \  7>.r|  /.  (t| 

-  n,v ( 1  •  y3)  •  ilyry  -  flzx  ■  {Ty  Tzy)  '/.  (*,) 

We  therefore  see  that  the  flow  equations  (4.5)  are  2rt,i  order  func¬ 
tions  of  (x,  y). 

Binocular  Flow  Anal  ys is 

We  now  extend  t lie  above  derivation  to  the  case  of  a  stereo 
pair  of  cameras  : 

The  two  image  planes  are  coplanar.and  perpendicular  to  the 
ground  plane.  We  assume  that  the  focal  points  lie  along  the 
same  horizontal  line  (see  Figure  2).  The  analysis  follows  that  of 

[30J. 

Let  us  denote  the  cartesian  coordinate  system  for  the  left 
camera  by  (A),  Vj,  Z/)  and  that  for  the  right  camera  by  (A"r,  Vj.,  ZT) 
Let  the  horizontal  displacement  between  the  two  focal  points,  Oj 
and  Or  (see  Figure  2)  be  b.  We  denote  entities  with  respect  to 
the  left  and  right  cameras  by  the  subscripts  l  and  r  respectively. 
The  final  formulations  are  with  respect  to  the  left  coordinate 
system. 

Consider  a  point  at  (A't.fj.Zi).  The  projection  of  this  point 
on  the  left  image  plane  is  given  by  (xj.yt).  Similarly,  the  point 


Figure  2:  Binocular  Camera  Configuration 


is  at  a  position  ( Xr ,  Yr .  Zr )  with  respect  to  the  right  coordinate 
system,  and  it  projects  on  the  right  image  plane  at  (rr,yr).  Let 
the  disparity  between  the  corresponding  image  points  be  given 
by  <5,  where 

S  xj  -  xr  =  6/Z,  (6) 

since 

Zt  =  zr  =  Z  (7) 

Also  note  that 

yi  =  yr  =  y-  (8) 

Let  us  now  consider  the  flow  equations  for  the  two  cameras. 
From  equations  (4,5),  we  have  with  respect  to  the  left  camera 

ai(xi,yi)  =  ~^x,xiyi  +n,-,(l  ■+  xf)  -  f iz,yt  4 

(Tx,  -  Tz,x,)/Z.  (9) 

3i{xi,yi)  -  -n.v,(l  *  y?)  4  n>-,x,yi  -  S!z,x(  ■ 

(Tr,  ~  Tz,yt)  Z.  II") 

and  with  respect  to  the  right  camera, 

ar(xr.yr)  -  -n.Y.iryr  1  n,-,(l  ~  x~)  Sl/.i/r  - 

(r.v>  -  Tz.fr)  z.  (II) 

d,(xr,yT)  ---  n.Y,(l  4  y;)  +  n y,xr<jr  *  Qz,xr  ■ 

( Ty ,  -Tz,yr)/Z.  (12) 

But  from  the  geometry  we  know  that 

n,  =  nr  =  n  (13) 

and 

Tr  =  Tj  -  f!|  y.  bit.  (14) 

where  x  is  the  unit  vector  pointing  from  Oi  to  Or.  Hence  we 
can  rewrite  the  flow  equations  (11,12)  for  the  right  image  plane 
using  equations  (13,  14).  We  then  have  with  respect  to  the  left 
coordinate  system, 

ar(x,  +  S,y,)  =  [-n.Y,(r(yt  +  yi)  +  nK,(l  +•  x,  +  x,2)  - 

^z,y i  +  (Tx,  -  txl  -  Tz,x,)!Z]b  (15) 
3r(x(  4  6.  y,)  -  ~n.Y,(l  +  yf)  +  ny,xiy,  4  ti^xj  4- 

(Ty,  -  Tz,yi)/Z.  (16) 

The  two  components  of  the  relative  flow  between  the  two  images 
are  thus  given  by 

ia  =  Qr(x(  4  6,yi)  -  0((x,.y,)  = 

ffij-.x,  -  fl.v.y,  -  Tx,/Z\t.  (17) 

Ad  =  dr(x|  4  #,  y()  3i(x,%  yi)  =  0.  (18) 

We  can  now'  write  the  ratio  of  the  relative  flow  between  the  two 
image  points  to  the  disparity,  expressed  with  respect  to  the  left 
camera  coordinate  system,  as  : 


Aa 

-  n-/x,  -  n.vyi  -  Tz/Z 

0 

(19) 

A3 

=  0. 
s 

(20) 

We  refer  to  the  vector  ( j,  -f)  as  the  relative  flow  vector ,  and 
a  field  of  these  vectors  as  the  relative  flow  field.  Since  only  the 
“left"  quantities  appear  in  the  remainder  of  this  work,  we  drop 
the  “/"  subscript  to  denote  this 

An  interesting  point  derived  in  3H  is  that  tin*  \-r<>nip<>tM*ni 
of  the  relative  flow  vector,  given  by  equation  (  IP),  i-  nfl-nln  al  i<> 


the  ratio  of  the  rate  of  change  in  the  disparity  to  the  disparity 
We  now  derive  this  relationship. 

We  know  that  the  disparity,  f*.  is  given  in  equal  inn  w  - 


Hence,  the  rate  of  change  of  disparity,  6,  is  given  as 

1221 

Referring  to  equation  (3),  we  liave 

|  =  ~(-ny-A'  4  n.Y Y  4  Tz)  =  -UyX  4  ilxy  <  Tz,Z  (23) 
Substituting  this  into  equation  (22),  we  find 

*  =  [nY*-nxv-Tz/z]6  (24) 

or 

l  =  nYx  -  Qxy  -  Tz/Z.  Q.E.D  (25) 

Hence,  we  can  obtain  the  relative  flow  field  data  in  one  of 
two  ways,  i.e., 

•  take  the  relative  flow  between  the  two  optic  flow  fields  cor¬ 
responding  to  the  left  and  right  cameras,  or 

•  approximate  the  differential  of  the  disparity  field  over  the 
interval  between  the  two  time  instances. 

We  choose  to  obtain  the  relative  flow  fields  using  the  former 
method.  This  view  is  supported  in  in  a  psychophysical  context 
since  experimental  studies  [22)  have  shown  some  subjects  to  have 
areas  of  the  visual  field  that  are  blind  to  static  disparity  and  yet 
possess  normal  sensitivity  to  motion  in  depth.  Hence,  we  believe 
that  the  extraction  of  motion  in  depth  by  computing  the  relative 
optic  flow  between  a  stereo  pair  is  a  closer  model  of  the  human 
perception  of  such  motion. 

4.  Computation  of  the  MID  parameters 
using  Stereoscopic  Motion  Constraints. 

A  Brief  Introduction  to  the  Approach 

We  employ  a  global  technique  to  extract  the  three  MID  pa¬ 
rameters  {fixi  Hy .  Tz]  in  the  presence  of  general  motion  of  the 
camera  and  several  independently  moving  objects. 

The  theoretical  basis  for  this  approach  to  the  integration 
of  the  stereo  and  motion  information  was  given  in  Section  3. 
Briefly,  if  u  is  the  relative  optic  flow  at  image  points  in  the  left 
and  right  cameras  corresponding  to  a  single  world  point  and 
the  disparity  between  the  left  and  right  images  at  any  one  time 
instant  for  the  same  point,  it  was  shown  that  the  ratio  of  u  to 
b  is  a  linear  function  of  the  image  coordinates.  The  coefficients 
of  this  linear  functional  are  the  rotations  about  the  A'  and  V' 
axes  and  the  translation  along  the  Z  axis,  which  are  precisely 
the  MID  parameters  being  computed  in  our  approach. 

Algorithm  Description 

The  inputs  to  the  algorithm  are  the  two  flow  fields,  one  each 
for  the  left  and  the  right  images,  and  the  disparity  field  between 
the  left  and  the  right  images  at  any  one  time  instant. 
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The  algorithm  proceeds  in  the  following  three  steps  : 

•  Extract  the  relative  flow  field  between  the  left  and  right 
images  using  the  difference  of  the  two  optic  flow  fields  along 
with  the  disparity  field. 

•  Employ  Adiv’s  technique,  ]1,2),  for  obtaining  the  segmen¬ 
tation  of  the  monocular  optic  flow  field  corresponding  to 
the  left  camera.  Thus,  we  perform  a  segmentation  of  the 
optic  flow  field  from  pure  motion  information  on  the  2  D 
image  plane  in  order  to  obtain  a  grouping  of  the  vector*, 
where  each  segment  corresponds  to  t  he  motion  of  a  roughly 
planar  surface.  We  discuss  the  effect  of  performing  a  sim¬ 
ilar  grouping  on  the  relative  flow  fields  in  Section  4.1 

•  Merge  the  segments  on  the  2D  image  plane  (obtained  from 
the  previous  segmentation  step)  based  on  a  least  squares 
minimization  to  compute  the  MID  parameters  for  each  of 
the  merged  regions.  The  output,  at  this  stage  is  a  grouping 
of  the  image  into  regions  that  correspond  to  the  same  set 
(within  some  normalized  value  of  the  deviation)  of  three 
MID  parameters,  i.e.,  the  two  parameters  of  rotation  in 
depth  and  the  scaled  translation  in  depth. 

In  the  interest  cf  robustness,  the  grouping  obtained  in  the 
segmentation  step  is  used  as  a  template  to  guide  the  areas  in 
the  image  where  the  minimization  is  employed.  The  actual  opti- 
mizaion  constraint  is  applied  to  the  information  in  the  relative 
flow  field  within  each  of  a  set  of  single  or  possibly  merged  group 
of  segments  in  order  to  extract  the  MID  parameters  for  them. 

Also,  since  we  deal  only  with  MID  parameters,  the  merged 
groups  of  regions  cannot  he  interpreted  as  representing  objects  in 
the  scene.  They  represent  those  regions  in  the  image  that  have 
the  same  set  (within  some  normalized  value  of  the  deviation) 
set  of  MID  parameters.  See  Expt.4  (Figure  5c)  in  Section  5  for 
an  example  of  a  grouping  wherein  the  background  plane  and  a 
stationary  ellipsoid  in  the  environment  get  merged  as  one  region 
simple  because  both  have  the  same  MID  parameters  relative  to 
the  camera. 

We  describe  these  two  steps  of  the  algorithm  in  the  following 
two  subsections. 


4.1  Segmentation 

At  this  stage  of  the  algorithm,  the  optic  flow  field  for  the  left 
camera  is  used  to  obtain  a  2-D  grouping  of  those  flow  vectors 
that  are  consistent  with  the  motion  of  an  approximately  planar 
patch  [1,2!.  We  also  discuss  the  effects  of  performing  a  similar 
segmentation  on  the  relative  flow  field. 

Formulation  Of  The  Segmentation  Constraint 

bet  us  first  examine  the  use  of  optic  flow  for  segmentation. 
We  briefly  review  the  segmentation  process  as  developed  in  1.2 
The  viewer  is  advised  to  see  these  for  more  details  If  we  consider 
the  flow  field  that  is  induced  by  the  motion  of  rigid  planar 
surface  described  by 

k,.\  •  •  I-,/  |  | -JC,  | 


where  (ai . «8}  are  functions  of  the  3-D  motion  parameters, 

(T,  fl),  of  the  objects  in  the  environment  (or  conversely,  the  cam¬ 
era)  and  the  three  planar  surface  parameters.  (kx,kv,kz)  : 


al 

=  Qy  *- 

kj's. 

a  2 

=  k,Tx 

-  kzTt  , 

<13 

-  nz 

■■  v*. 

=  -n.v 

-  k/ry. 

-  flz  i 

kjTy  . 

a<3 

-  k,Ty 

-  kzTz. 

*  Si,-  - 

kzTz, 

«8 

=  -n.v 

-  kvTz. 

The  desired  output  is  an  organization  of  the  optic  flow  field 
into  segments,  where  each  segment  corresponds  to  the  motion 
of  a  roughly  planar  surface.  The  constraint  used  to  perform 
this  grouping  is  that  each  segment  thus  formed  be  consistent 
with  a  single  'I'-transformation  (equations  (27,28)}.  The  <1»- 
transformation  space  corresponds  to  the  coefficients  of  the  sec¬ 
ond  order  flow  equations  as  functions  of  image  coordinates.  This 
space  is  8-dimensional,  and  searching  it  for  the  'I'-transformation 
that  is  consistent  with  the  motion  in  the  flow  field  would  be  very 
expensive  computationally.  Hence,  the  consistency  constraint  is 
approximately  implemented  by  a  two-step  process  : 

•  As  a  preprocessing  step,  the  flow  vectors  are  first  grouped 
based  on  consistency  with  the  6-pararneter  linear  approxi¬ 
mation  to  equations  27,28  (an  affine  transformation)  using 
a  modified  Hough  transform.  This  yields  components. 

•  These  components  are  then  merged  together  based  on  the 
minimization  of  an  error  measure  derived  using  the  full 
optimal  'I'-transformation. 

We  note  several  interesting  features  of  this  stage.  The  six  pa¬ 
rameter  affine  transformation  corresponds  to  a  'f'-t ransformat ion 
with  07  =  as  =  0.  In  addition,  the  flow  components  o(d)  de¬ 
pends  only  on  the  parameters  {«| , o2, aj}({<i.|.  u,  } ).  i.e..  the 
two  components  are  decoupled.  Because  of  this,  the  significant 
computational  cost  of  a  Hough  transform  on  the  G-dimerisional 
affine  space  can  be  mitigated  by  instead  performing  a  «epor- 
ate  Hough  transform  on  each  of  the  two  three-parameicr  affine 
spaces  {01,02,03}  and  {<i.,.us.u,,}.  In  addition  the  transform 
is  implemented  on  the  3-parameter  affine  spaces  in  a  multi¬ 
resolution  scheme  that  makes  use  of  the  concept  of  dynamically 
quantized  spaces  wherein  the  transform  is  iteratively  employed 
around  the  value  estimated  in  the  previous  iteration  using  a  finer 
resolution. 

It  is  important  to  note  that  this  segmentation  step  uses  an 
approximation  to  determine  the  grouping.  Hence,  the  values  of 
the  ^-transformation  themselves  are  not  used  to  compute  the 
motion  parameters  This  grouping  is  used  as  a  mask  on  the 
relative  flow  field  in  order  to  determine  tile  values  of  the  MID 
parameters  without  making  any  surface  approximations,  in  order 
to  increase  the  robustness  of  the  algorithm 

Difficulties  with  direct,  use  of  the  relative  flow  field 
for  segmentation 


Then  wo  can  rewrite  equations  ( |  ',)  a* 

o  "1  -  ":r  •  ■  n jj"  ■  'I . r ij  (27) 

d  n.  |  •  ar *  0,1/  •  nmj  ■  a  -  y :  (28) 


We  now  discuss  our  reasons  for  using  the  pure  motion  flow 
fields,  rather  than  the  relative  flow  field,  for  segmentation  pur¬ 
poses  I'sing  the  planar  patch  approximation  (20)  in  the  relative 
flow  equations  (19.20).  we  find 
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Aq 

f  -  »0  hr  -v  b2y. 

(30) 

¥  -  » 

(31) 

where 

bq  =  -Tzk2, 

b 1  =  Hy-  kxTz , 

&2  =  -nx-kvTz.  (32) 


Upon  comparing  equations  (29)  and  (32)  we  see  that  the  sec¬ 
ond  order  components  of  the  optic  flow  field,  i.e.,  the  coefficients 
<37  and  as,  are  precisely  the  first  order  components  of  the  relative 
flow  field,  namely,  6i  and  62  respectively: 

o7  =  fcj. 

«8  =--  h  (w) 

Grouping  using  the  relative  flow  constraint  would  thus  result  in 
a  set  of  relative  flow  vectors  that  are  consistent  with  the  same  set 
of  three  parameters,  6n,  61  and  62  which  amounts  to  dealing  only 
with  the  second  order  components  of  the  optic  flow  field,  and 
disregarding  the  first  order  components.  This  creates  incorrect 
grouping  since  in  situations  where  the  motion  is  such  that  the 
second  order  components  of  the  flow  field  are  very  small,  the 
relative  flow  field  is  going  to  be  a  function  of  very  small  first 
order  coefficients.  This  occurs  in  situat  ions  where  the  translation 
and  rotations  in  depth  are  small.  A  Hough  transform  on  such 
a  space  will  create  false  peaks  in  the  parameter  space  and  thus 
be  unreliable.  Thus  for  purposes  of  grouping,  we  would  like 
to  utilise  all  the  available  information  and  obtain  as  reliable  a 
grouping  as  possible. 

Hence,  we  use  the  optic  flow  (corresponding  to  the  left  cam¬ 
era)  in  order  to  obtain  the  segmentation  mask  to  be  used  in  the 
optimization  on  the  relative  flow’  field.  In  the  current  implemen¬ 
tation  the  optic  flow  corresponding  to  the  left  camera  is  chosen 
to  obtain  the  segmentation. 


4.2  Optimization 


Segments  that  are  consistent  with  the  same  set  of  three  MID 
parameters  are  merged  together  in  this  stage,  which  proceeds  in 
several  steps: 

•  As  an  initial  step,  optimal  MID  parameters  and  a  related 
error  measure  (see  35)  are  computed  for  each  of  the  seg¬ 
ments  obtained  from  the  segmentation  step.  The  MID  pa¬ 
rameters  are  computed  using  a  least  squares  error  mini¬ 
mization  (see  34). 

•  Sets  of  segments  are  sequentially  tested  for  merger  by  de¬ 
ciding  if  the  relative  flow  field  in  them  corresponds  to  a 
single  set  of  MID  parameters. 

•  The  merging  decisions  are  based  on  the  degree  of  consis¬ 
tency  of  the  relative  flow  vectors  in  the  entire  set  of  seg¬ 
ments  being  tested  for  merging  to  the  same  set  of  three 
optimal  MID  parameters.  This  is  done  by  comparing  the 
error  measure  obtained  by  taking  the  entire  set  of  segments 
with  the  error  measures  for  each  of  I  lie  segments  taken 
singly.  Both  the  error  measures  are  computed  with  respect 
to  the  three  optimal  MID  parameter*  that  are  nbtained  for 


the  merged  set  using  a  least  squares  minimization. 

•  Only  segments  that  have  not  been  included  in  any  previous 
merging  are  included  in  the  next  merging 

•  The  MID  parameters  are  computed  for  the  merged  sets  <>f 
segments. 

•  All  segments  that  correspond  to  the  stationary  environ¬ 
ment  are  grouped  together  as  one  merged  segment. 

•  Independently  moving  objects  are  picked  out.  unless  the 
flow  corresponding  to  the  regions  in  and  around  the  objects 
is  such  as  to  produce  ambiguities  during  the  interpretation 
process. 

•  The  computation  at  this  stage  is  linear  in  complexity  since 
the  technique  considers  only  the  neighbouring  segments  for 
merging  decisions.  This  is  reasonable  since  we  are  search¬ 
ing  for  groups  in  the  image  that  correspond  to  the  same 
set  of  MID  parameters  rather  than  recognise  object  masks. 

Minimization  Process 

It  is  required  that  the  MID  parameters  be  extracted  in  the 
simplest  possible  manner  that  preserves  robustness  in  the  entire 
computation.  In  general,  it  is  possible  to  obtain  them  for  each 
segment  from  the  values  of  the  relative  flow'  field  at  three  points 
in  that  segment.  We  could  then  think  of  ways  to  merge  neigh¬ 
bouring  segments  that  exhibit  the  same  MID  parameters.  But 
two  factors  need  to  be  considered  First,  the  flow  fields  and  the 
disparity  field  are  generally  prone  to  corruption  by  noise.  Sec¬ 
ond,  the  current  method  of  obtaining  the  relative  flow  field  by 
taking  the  difference  of  two  flow  fields  adds  numerical  errors  to 
the  data.  Given  these  two  sources  of  data  distortion,  it  is  impor¬ 
tant  to  use  all  possible  information  in  computing  the  parameters. 
A  least  squares  error  formulation  is  well  suited  for  this  purpose 
since  it  is  more  global  in  nature  and  robust  in  implementation. 

Computation  of  the  MID  parameters 

The  least  squares  formulation  minimizes  the  error  between 
the  actual  relative  flow  field  value  at  a  point  and  that  predicted 
by  the  MID  parameters  and  the  depth  of  the  point.  Hence,  given 
a  set.  of  flow  vectors,  it  selects  the  optimal  set  of  three  MID 
parameters  that  are  consistent  with  the  minimal  deviation  in 
the  relative  flow  field  predicted  by  them  from  the  actual  relative 
flow  field. 

Optimization  Constraint 

Based  on  equations  (19,  20).  the  error  function  to  be  min¬ 
imized  over  the  set  of  relative  flow  vectors  corresponding  to  a 
single  segment,  or  a  possibly  merged  set  of  them  is 

£(nx.n,-.rz)  =  ^w,:(^ft)  - nrr,  *n.vlh  •  7>  *,j2.  (3-1) 

where  T  =  ( T,\.Ty,Tz )  is  the  translation  vector.  !J  •=  (n.y,f}y, 
0^)  is  the  rotation  vector  ami  Z,  is  the  depth  of  the  environ¬ 
mental  point  corresponding  to  the  image  point  (i,.y,).  IV,  is 
the  weight  (or  confidence  measure  1.51)  associated  with  (he  flow 
vector  at  the  point  ( -r, ,  y, ) .  It  is  required  to  obtain  the  three 
MID  parameters,  (fly,  Qp .  Tz).  that  minimize  the  above  func¬ 
tional.  This  is  done  by  different iat mg  I  lie  funct mnal  with  respect 
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to  each  of  them,  and  setting  the  resulting  differential  equal  to 
zero.  We  can  then  solve  the  resulting  matrix  equation  for  the 
three  MID  p^ameiers.  The  compulatu,..  at  this  stc,  is  !ast  since 
we  only  need  to  solve  a  3x3  linear  system,  with  some  additional 
multiplication  for  the  weighting  process  and  summation  over  the 
set  of  relative  flow  vectors.  For  the  purposes  of  the  present  for¬ 
mulation,  we  use  the  disparity  field,  £,  to  provide  the  absolute 
depth  information.  Zt.  Thus,  we  obtain  the  absolute  translation 
in  depth,  and  the  two  rotations  in  depth,  H*  and  Uy . 

Given  a  set  of  ri  flow  vectors,  let  the  solution  to  the  minimiza¬ 
tion  constraint  be  given  by  Pm  -  (fTY,  Qy,  Tz).  The  normalized 
value  of  the  deviation  in  the  actual  relative  flow  field  from  that 
predicted  by  the  solution  P"  can  then  be  given  by 


5.  Simulation  Results 


In  this  section,  we  use  stereoscopic  motion  to  compute  the 
MID  parameters  of  the  objects  in  the  environment  and  of  the 
camera.  The  results  are  shown  for  a  set  of  simulations  that 
were  devised  to  cover  the  following  categories  of  motion  for  rigid 
objects 

•  translation  in  depth  alone. 


•  rotation  in  depth  alone. 

•  combined  translation  and  rotation  in  depth,  and 


•  general,  independent  object  and  camera  motion. 

The  input  data  are  the  simulated  flow  fields  corresponding 
to  the  left  and  right  cameras  as  well  as  t  he  simulated  static  dis¬ 
parity  field  between  the  two  cameras  for  the  first  time  instance 
Note  that  the  flow  fields  are  dense  and  could  only  correspond  to 
highly  textured  surfaces.  In  the  first  four  experiments,  we  have 
shown  the  algorithm  performance  for  ideal  flow  and  disparity 
fields.  In  the  fifth  and  sixth  experiments,  we  show  the  algorithm 
performance  with  gaussian  noise  added  to  the  flow  fields.  Note 
that  the  algorithm  assumes  as  input,  the  optic  flow  and  disparity 
values,  and  hence  bypasses  the  correspondence  problem. 

Values  of  the  translation  parameters  and  the  surface  param¬ 
eters  are  in  focal  units,  the  flow  and  disparity  vectors  are  in  pixel 
units,  and  the  rotation  parameters  are  in  radians.  The  image  is 
128  v  |28.  The  field  of  view  of  each  of  the  cameras  is  45°  The 
baseline  is  0.5  focal  units,  giving  rise  to  disparity  values  of  about 
1  to  6  pixels  for  the  simulated  environment.  The  size  and  posi¬ 
tion  of  the  objects  are  given  in  focal  units  and  are  with  respect 
to  the  left,  camera,  coordinate  system,  as  are  t ho  motion  values. 


Experiment  1  :  Object  Translating  in  Depth 


This  experiment  demonstrates  the  chief  advantage  of  using 
stereoscopic  constraints  in  motion  computation,  ie.  the  fast 
computation  of  the  motion  of  objects  t  ranslating  in  depth  Such 
motion  represents  the  direct  motion  of  the  object  along  a  line 
parallel  to  the  line  of  sight.. 

Thp  simulated  input  motion  and  the  computed  translation 
in  depth  for  the  moving  object  are  shown  in  Fable  I  The  envi¬ 
ronment  consists  of  two  distinct  surfaces 


1.  a  plane  described  by  the  equation  /  MM). 


size 

position 

2,2,2 

-3, -1,15 

Object  Translation 
(fatal  units) 


Computed 

TT'"'  =  1  08 


Table  1:  Translation  in  depth  of  sphere  of  radins=2  Camera 
is  stationary. 


2.  a  sphere  of  radius  - 2,  at  position  (  3,  1,15)  translat  ing 

with  Tz  =■-  1.00. 


The  camera  system  is  stationary 


Note  that  the  computed  value  of  the  translation  in  depth  is 
the  absolute  (not  relative)  value  since  disparity  information  is 
used  to  obtain  the  depth  (see  Table  1) 


Experiment  2  :  Object  rotating  in  depth. 


In  this  experiment,  we  demonstrate  the  ability  of  the  model 
to  find  the  rotation  in  depth  of  objects.  '1  his  will  be  used  in  fu¬ 
ture  studies  to  model  the  perception  of  the  structure  of  rotating 
objects  |9,28.29|.  The  motion  of  interest  is  a  spinning  motion 
about  axes  parallel  to  the  image  planes. 

The  simulated  input  motion  and  the  computed  rotation  in 
depth  for  the  moving  object  are  shown  in  I  able  2.  The  environ¬ 
ment  consists  of  two  distinct  surfaces  : 

1.  a  plane  described  by  the  equation  Z  -  10b. 


2.  a  sphere  of  radius- 2,  at  position  (  •*.  1.  I>)  rotating 

with  nv  0.05  and  Uy  0.05. 

The  value  of  the  rotation  in  depth  parameter-  -output. -I  h> 
the  algorithm  are  shown  in  Table  2. 


size 

position 

Object  Rotation 
(radians) 

cs 

c[ 

O* 

_ 

-3.-1. 15 

Inpul 

n.v  n.nr, 
n,  (>.05 

Compultd 

ny  o.or 

Sl,.'”'r  0.01 

Table  2:  Rotation  in  depth  of  sphere  of  radius  2.  Camera  is 
stationary. 


Experiment  3  :  General  motion  in  depth  of  object 


It  is  shown  here  that  the  algorithm  make-  it-,  a— nmpi  i<  »u- 
about  the  motion  of  the  object  (s)  and  is  applicable  m  t  lie  pres¬ 
ence  of  both  rotation  and  translation  in  depth.  I  Ins  example  is 


a  simulation  of  the  motion  of  a  spinning  ball  being  thrown  to¬ 
wards  t he  observer  The  object  motion  consists  of  translational 
components  along  the  the  A  and  Z  axes,  and  a  single  rotational 
component  about  til?  V  axis  (both  stated  with  respect  to  the  left 
camera  coordinate  system)  1  he  camera  system  is  stationary 
The  values  of  both  the  translation  in  depth  and  the  rotations 
in  depth  are  computed  and  are  shown  in  labie  3 

The  input  being  simulated  is  as  follows  I  he  environment 
consists  of  two  dirtinct  surfaces 

I  a  plane  described  by  the  ‘spiat ion  7.  1 00. 


i1 

% 

3N 


i 

4 


W 


v*  wjdw  wuw  v?rj»  vtj  vv; 


size 

po.  ition 

Object  Translation 
(focal  units ) 

Object  Rotation 
( radians) 

2,2,2 

-3.-1. 15 

Input 

Computed 

Input 

Computed 

Tx  =  0.50 
T2  =  1.20 

T2mp  -  1.12 

Qx  =  0.05 

neymr  =  o.o8 

Table  3:  Motion  in  depth  of  sphere  of  radius-2.  Camera  is 
stationary.  Note  that  only  the  MID  parameters  are  computed 


2.  a  sphere  of  radius=2,  with  position  —  (1,4. 15),  with  trans¬ 
lation  (Tx  =  0.5,  Tz  =  1.20)  and  rotation  (0*  =  0.05). 


Note  in  Table  3  that  T\  is  not  given.  This  is  because  our 
algorithm  computes  only  the  MID  parameters. 


Experiment  4  :  General  camera  motion  and  indepen¬ 
dent  object  motion 


In  this  example,  we  show  the  performance  on  a  simulation 
of  both  camera  motion  and  independent  object  motion.  The 
camera  motion  is  completely  general  with  all  the  components  of 
translation  and  rotation  being  present.  The  object  motion  has 
NO  MID  component,  but  has  the  other  three  components  of 
motion,  i.e.,  translations  along  the  A'  and  V  axes,  and  rotation 
about  the  Z  axis. 

The  input  being  simulated  is  as  follows  The  environment 
consists  of  two  distinct  surfaces  : 

•  the  background  described  by  a  plane.  Z  A  0.5}  •  50. 


a  sphere  with  position  ~  (9,9. 30).  radius  -  2  and  motion 
described  by  {Tx*Ty,Tz)  (0.5, -0.5. 0.0)  and 

(nx,n  y,nz)  =  (0.00,0.00. -0.19). 


•  a  stationary  ellipsoid  with  position  =  (-3,  —  1 , 20)  and  size 
=  (2,5,2). 

The  motion  of  the  camera  is  (T\ ,Ty ,Tz)  -  (0.5, 0.5, 1. 0) 
and  [Qx^Y^l)  =  (0.02,-0.02,0.04). 

The  flow  fields  corresponding  to  the  left  and  right  cameras  as 
well  as  the  disparity  field  between  them  at  the  first  time  instance 
are  shown  in  Figures  4a,  4b  and  4c,  respectively.  Figure  4d 
represents  the  segmentation  mask  obtained  from  the  left  optic 
flow  field  to  be  used  as  a  mask  in  the  optimization  stage. 

Note  that  in  Figure  4e.  the  minimization  step  of  the  algo¬ 
rithm  results  in  the  stationary  ellipsoid  getting  merged  with  the 
background.  This  is  because  both  have  the  same  MID  attributes 
relative  to  the  moving  camera.  The  independently  moving  sphere 
is  still  retained  as  a  seperate  region  since  it  has  MID  attributes 
relative  to  the  moving  camera  that  are  distinct  from  those  nf 
the  stationary  background.  This  is  an  example  of  the  fact  that 
the  algorithm  does  not  pick  out  object  masks,  but  does  pick  out 
regions  in  the  image  with  the  same  set  of  MID  parameters. 

We  note  that  the  computed  values  of  the  Mil)  parameters  for 
the  independently  moving  sphere  are  inaccurate.  We  attribute 
this  to  the  following  fact  that  the  number  of  relative  flow  vectors 
(19,20)  in  the  mask  corresponding  to  the  segmentation  of  the 
sphere  is  small.  When  the  values  are  computed  using  the  least 
squares  minimization  process  on  this  small  set  of  vectors,  errors 
can  be  expected. 


See  Table  4  for  the  computed  MID  parameters. 


Experiment  5  :  Object  Translating  in  Depth 
Optic  Flow  Fields 


Noisy 


In  this  experiment,  we  show  t  he  performance  of  the  algorithm 
for  the  computation  of  the  t  ranslation  in  depth  of  the  object  with 
noisy  optic  flow  fields  as  input. 

The  environment  simulated  is  identical  to  that  in  Experiment 
1,  with  the  camera  being  stationary.  We  have  added  gaussian 
noise  of  a  —  0.3  to  the  optic  flow  fields  for  the  left  and  the  right 
camera.  These  are  shown  in  Figures  5a  and  5b  respectively.  The 
results  of  the  segmentation  on  the  left  optic  flow  field  is  shown 
in  Figure  5c.  Table  5  shows  the  results  of  the  computation  of 
the  translation  in  depth  for  the  object. 

Note  that  the  error  in  the  computation  is  not.  significantly 
greater  than  that  shown  in  Table  1  for  experiment  I. 


Experiment  6  :  General  camera  motion  and  indepen¬ 
dent  object  motion  -  Noisy  Optic  Flow  Fields 


In  this  example,  we  show  the  performance  of  the  algorithm 
in  the  presence  of  both  camera  motion  and  independent  object 
motion  with  noisy  optic  flotr  fields  as  input. 

The  camera  motion  is  completely  general  with  all  the  com¬ 
ponents  of  translation  and  rotation  being  present.  The  object 
motion  has  NO  MID  components,  but  has  t lie  other  three  com¬ 
ponents  of  motion,  i.e..  translations  along  the  A  and  V  axe*, 
and  rotation  about  the  Z  axis. 

Note  that  the  simulated  input  motion  and  environment  de¬ 
scription  is  identical  to  that  in  Experiment  4.  However,  gaussian 


noise  of  a  =  0.3  is  added  to  the  corresponding  optic  flow  fields 
for  the  left  and  the  right  cameras.  These  are  shown  in  Figures 
6a  and  6b.  Figure  6c  show-s  the  results  of  segmenting  the  left 
optic  flow  field  using  Adiv’s  technique  |1j.  In  Table  6,  we  show 
the  MID  parameters  computed  for  the  camera  and  the  indepen¬ 
dently  moving  sphere. 

Note  that  in  Figure  6d,  the  stationary  ellipsoid  is  still  getting 
merged  with  the  background,  while  the  independently  moving 
sphere  continues  to  be  picked  out,  just  as  in  Experiment  4  (see 
Figure  4e). 

The  MID  parameters  computed  for  the  independently  mov¬ 
ing  sphere  are  even  more  inaccurate  than  those  computed  for 
the  ideal  flow  fields  in  Exp  4  This  we  attribute  to  the  fact  that 
the  number  of  relative  flow  vectors  (19.20)  in  the  mask  corre¬ 
sponding  to  the  segmentation  of  the  sphere  i*  small  a*  well  a- 
the  fact  that  the  error  in  the  relative  flow  vectors  inrrra*r-  with 
non-ideal  flow  fields.  We  note  that  the  computed  \alue*  <*f  the 
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X  -  0.50 
Ty  -0.5 


position  Objrrt  Translation  Object  Rotation 

(focal  units)  (radians) 


Input  Computed  Input  Computed 

9.9.30  Tv  -0.50  n.v  =  o.oo  nxw''  -  OO-f 

Ty  -  0  5  nY  =  o.oo  n  ymr  =  o.o2 

_ Tz  -  0.00  Tz'"'v  -  0  1 1 _ Uz  -  -0.19 _ 

position  Object  Translation  Object  Rotation 

(focu!  units)  (radians) 


-.1.-1. 20  stationary 


stationary 


Camera  Translation 
(for a!  unit:) 


Input  Computed 


Tx  ~  0.50 
Ty  0.05 
Tz  1.0 


Camera  Rotation 
(  radians) 


Computed 


Figure  5a:  Simulated  noisy,  dense  optic  flow  field  for  the  left 
camera. 


Figure  5c:  Result  of  segmentation  performed  using  Adiv’s 
algorithm  [1). 


Figure  5b:  Simulated  noisy,  dense  optic  flow  field  for  the 
right  camera. 


Figure  5d:  Simulated  ideal  dense  field  of  disparity  vectors. 
Baseline  is  0.5  focal  units. 


size 

position 

Object  Translation 
(focal  units) 

2,2.2 

-3,-!.  15 

Input  | 

Computed 

7>  -  I  00  j 

rTwr  i  in 

Table  5:  Translation  in  depth  of  sphere  of  radius  -2.  Gaus¬ 
sian  noise  of  a  0.3  added  to  ihe  optic  flow  fields.  Camera  is 


J 

Figure  6a:  Simulated  noisy,  dense  optic  flow  field 
for  the  left  camera. 


Figure  6c  Result  of  segmentation 

performed  using  Adiv’s  algorithm  [1|. 


Figure  6b:  Simulated  noisy,  dense  optic  flow 
field  for  the  right  camera 


size 

II 

Object  Translation 
(focal  units) 

m 

sphere 

2.2,2 

9,9.30 

Input 

Computed 

Input 

ellipsoid 

size 

position 

Object  Translation 
( focal  units) 

Object  Rotation 
(radians) 

2,5,2 

-3, -1,20 

stationary 

plane 

7  =  X  +  0 

5Y  +  50 

stationary 

camera 

Camera  Translation 
(focal  units) 

Camera  Rotation 
(radians) 

Input 

Computed 

Input 

Computed  | 

r.v  -  o.5o 
Ty  -  0  05 
Tz  1  0 

•jnfi'mp  |  2 

T.-'de  6:  General  camera  motion  with  independent  object,  mo¬ 
tion.  Gaussian  noise  of  rr  0.3  added  to  the  optic  flow  fields 
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MID  parameters  for  the  camera  are  identical  to  those  values 
computed  for  the  id^al  flow  fields  in  Evnerimen*  I  (set*  Table  I) 
demonstrating  that  the  optimization  step  shows  robust  perfor¬ 
mance  when  the  number  of  relative  flow  vectors  available  for  the 
computation  is  not  small,  even  if  they  are  erroneous. 

6.  Summary 

We  have  so  far  described  how  stereoscopic  motion  can  be 
used  to  extract  the  MID  parameters.  The  chief  motivation  was 
the  need  to  obtain  the  MID  parameters,  particularly  translation 
in  depth,  as  quickly  as  possible.  These  parameters  can  be  ob¬ 
tained  from  a  general  purpose  motion  algorithm  such  as  Adiv’s 
f  1 1 ,  which  computes  all  the  six  motion  parameters.  But  these 
algorithms  are  computationally  intensive  and  do  not  permit  a 
quick  yet  reliable  grouping  of  the  information  from  the  imagery 
into  regions  corresponding  to  objects  moving  in  depth.  One  of 
the  chief  reasons  for  this  is  that  the  computations  need  to  deal 
with  nonlinear  equations  that  relate  the  flow  information  to  the 
motion  and  structure  parameters. 

Use  cf  steieoscopic  motion  for  the  computation  of  the  MID 
parameters  simplifies  and  speeds  up  the  computation  because  of 
the  following  reasons 

1.  The  chief  advantage  provided  by  the  integration  of  stereo 
and  motion  information  for  extracting  motion  in  depth 
is  that  stereo  information  provides  additional  constraints 
that  make  it  possible  to  formulate  a  linear  relationship  be¬ 
tween  tne  data  in  the  flow  and  disparity  fields  and  the  mo¬ 
tion  in  depth  parameters,  {see  equations  (19,  20)}.  This, 
in  turn,  makes  it  possible  to  devise  a  direct  computation 
without  any  hypothesizing  of  the  motion  parameters  such 
as  is  done  in  [  1 ,20] . 

2.  The  technique  does  not  consider  all  possible  subsets  of 
the  segments  for  grouping  purposes.  It  only  considers  the 
neighbouring  segments  for  merger  decisions.  This  is  rea¬ 
sonable  since  we  are  searching  for  groups  in  the  image  that 
correspond  to  the  same  three  MID  parameters,  rather  than 
seeking  object  masks  This  makes  the  complexity  of  t h i*= 
step  linear,  rather  than  exponential  (as  in  I  ) 

3.  The  computation  of  the  three  MID  parameter*  are  the  re¬ 
sult  of  a  direct  computation  using  the  optimization  con¬ 
straint  on  the  relative  flow  field  This  ran  be  used  to 
directly  compute  the  remaining  three  motion  parameters, 
again,  without  employing  a  hypothesize  and  test  scheme. 
Hence,  extracting  all  the  motion  parameters  becomes  a 
problem  of  handling  two  sets  of  linear  funct  ionals. 

•I.  The  algorithm  can  be  employed  on  motion  sequences  that 
include  several  independently  moving  objects  and  involve 
the  general  translation  and  rotation  of  the  objects  and  the 
camera 

A  problem  that  we  hope  to  resolve  in  future  work  has  to  do 
with  the  nature  of  the  relative  flow  fields.  Since  they  are  com¬ 
puted  as  a  different e  of  the  two  flow  fields,  they  are  susceptible 
to  error  if  the  flow  values  are  small.  Hence,  the  optimization 
step  is  more  accurate  with  larger  motion  values.  Incorporating 
minimization  norms  that  take  this  into  account  may  help. 

In  the  current  implementation,  we  use  the  disparity  field  to 
provide  us  with  the  depth  information  in  the  computation  In 


future  work,  we  hope  to  use  the  current  results  as  a  startup 
process  and  extend  to  a  multi-frame  paradigm  for  computing 
the  motion  over  several  frames  as  well  as  using  this  disparity 
information  as  a  prediction  of  the  depth  and  refining  the  depth 
map  over  several  frames. 
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ABSTRACT 


The  fundamental  assumption  of  almost  all  existing  com¬ 
putational  theories  for  the  perception  of  structure  from 
motion  is  that  moving  elements  on  the  retina  correspond  pro- 
jectivelv  to  identifiable  moving  points  in  three-dimensional 
space.  Furthermore,  these  computational  theories  are  based 
on  the  fundamental  idea  of  retinal  motion,  i.e.  they  use  as 
their  input  the  velocity  with  which  image  points  are  moving 
(optic  flow  or  discrete  displacements).  In  this  research,  we 
investigate  the  possibility  for  the  development  of  computa¬ 
tional  theories  for  the  perception  of  structure  from  motion 
that  are  not  based  on  the  concept  of  the  velocity  of  individual 
image  elements,  i.e.  they  do  not  use  optic  flow  or  displace¬ 
ments  as  input. 


1.  INTRODUCTION 

The  problem  of  structure  from  motion  has  received  con¬ 
siderable  attention  lately  (Ullman,  1079;  Longuet-lliggins  .fc 
Prazdny,  1980;  Tsai  and  Huang,  1984).  The  problem  is  to 
recover  the  three-dimensional  motion  and  structure  of  a  mov¬ 
ing  object  from  a  sequence  of  its  images.  F,ven  though  com¬ 
putation  of  structure  and  3-D  motion  arc  equivalent  when  the 
retinal  motion  is  given,  the  two  problems  have  re  'ived 
different  names,  the  former  "Structure  from  Motion”  and  the 
latter  “Passive  Navigation”.  We  will  refer  to  them 
interchangeably. 

Basically  there  have  been  two  approaches  toward  solving 
this  problem.  The  first  assumes  “small”  motion.  In  this  case, 
if  the  three-dimensional  intensity  function  (one  temporal  and 
two  spatial  arguments)  is  locally  well  behaved  and  its  spatio- 
temporal  gradients  are  defined,  then  the  image  velocity  field 
(or  optic  flow)  may  be  computed  (Horn  and  Schunck.  1981; 
Hildreth,  1984).  The  second  approach  assumes  that  the 
motion  is  “large”  and  measurement  of  image  motion  entails 
solving  the  correspondence  problem.  Imaged  feature  points 
due  to  the  same  three  dimensional  artifact  (e.g  texture  ele¬ 
ment,  edge  junction,  etc.)  in  two  successive  dynamic  frames 
arc  assumed  to  be  identified  correctly.  Algorithms  using  this 
approach  compute  3-D  transformation  parameters  from  the 
displacements  field  (Tsai  and  Huang,  1981;  Longuet-lliciins, 
1981;  l’llman,  1979). 
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The  fact  is  that  computing  retinal  motion  (optic  How  or 
discrete  displacements)  is  an  ill-posed  problem  in  the  sense  of 
Hadamard  (1923);  that  means  that  in  order  to  compute  reti¬ 
nal  motion  we  must  make  some  assumptions  about  the  world 
that  might  not  always  be  present  in  the  imaged  scene.  Until 
a  rigorous  theory  of  discontinuous  regularization  is  developed, 
the  problem  of  computing  retinal  motion  remains  a  very  hard 
one,  as  also  acknowledged  by  Ullman  (Ullman,  1981). 
Without  arguing  about,  whether  or  not  correspondence  is  a 
basic  process  in  the  human  visual  system  (and  it  should  be  so 
in  a  computer  vision  system),  we  investigate  the  problem  of 
structure  from  motion  without  using  correspondences.  We 
develop  computational  theories  for  the  problem  that  are  not 
based  on  the  concept  of  local  retinal  motion  or  correspondence 
of  tokens.  In  this  way,  we  show  that  in  principle,  correspon¬ 
dence  or  retinal  motion  are  not  necessary  for  computing 
“structure  from  motion”.  To  understand  whether  or  not 
algorithms  based  on  correspondenceless  “structure  from 
motion”  computational  theories  are  used  by  the  human  visual 
system  is  beyond  the  scope  of  this  paper.1  Since  our  research 
interests  lie  primarily  in  machine  vision,  we  will  claim  only 
that  our  correspondenceless  computational  theories  may  be 
used  as  the  basis  of  algorithms  for  the  problem  of  machine 
“passive  navigation”,  because  they  are  not  based  on  the  con¬ 
cept  of  retinal  motion  or  correspondence,  whose  computations 
arc  ill- posed  problems  (Poggio  et  al.,  1985). 

The  organization  of  the  paper  is  as  follows:  Section  2 
reviews  previous  work,  and  then  we  progressively  build  up  our 
computational  theories,  by  starting  with  some  assumptions 
which  arc  eliminated  in  the  course  of  our  analysis.  So,  Sec¬ 
tion  3  addresses  the  problem  in  the  case  where  the  depth  of 
the  scene  is  known.  Section  4  solves  the  problem  when  the 
shape  (something  less  than  depth)  is  known,  and  Section  5 
addresses  the  problem  in  the  general  case,  where  nothing  is 
known  about  the  scene.  All  treatments  arc  done  for  perspec¬ 
tive  projection.  Finally,  Section  6  concludes  the  work  and 
discusses  future  research. 


2.  PREVIOUS  RESEARCH 

Addressing  the  problem  of  structure  from  motion  in  a 
correspondenceless  way  is  a  relatively  new  idea.  According  *o 
our  knowledge,  the  problem  was  first,  addressed  by  Aioimonos 
and  Brown  (1981)  in  a  simple  way.  It  was  shown  in  this  work 
that  the  problem  could  He  solved  in  the  case  of  pure  rotation 
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‘However,  some  researchers  might  claim  so  (Jcukin  and  Holers, 
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In  the  case  of  egomotion,  when  there  exists  only  pure  rota¬ 
tion,  the  optic  flow  («,  v)  at  every  image  point  (j,  y)  is  given 
by 

u(z,  y)  =  ur1iy  -I,.'.; {x2+  1)  f  u)3  y  (1) 

!')  =  ^1  (y2  + 1) -^2  1  .  (2) 

where  Q  =  (w,,  w2,  cj3)  is  the  rotation  of  the  observer  (see  Fig¬ 
ure  1). 

On  the  other  hand,  if  s(z,y,l)  is  the  time  varying  image 
intensity  function,  then  the  following  relation  is  true  (Horn, 
1985): 

sz  u  4-  s„  v  -t-  s,  =  0  (3) 

where  the  subscripts  denote  partial  differentiation.  Substitut¬ 
ing  (1)  and  (2)  into  (3),  we  get  a  linear  equation  in  the  unk¬ 
nowns  cjj.wj,  ui3  whose  coefficients  are  functions  of  retinal 
position  and  the  image  intensity  function  and  its  spatiotem- 
poral  derivatives.  Considering  this  equation  in  many  image 
points  we  get,  through  some  aggregation  method  (least 
squares;  Hough  transform),  the  unique  solution.  Of  course  in 
this  case  structure  cannot  be  computed,  because  there  is  no 
translational  motion. 

At  about  the  same  time  Kanatani  independently  was 
devising  general  methods  for  recovering  structure  from  motion 
without  correspondence  for  planar  surfaces,  using  the  idea  of 
linear  features  introduced  in  (Kanatani,  1984,  1985).  At  the 
same  time  Negahdaripour  and  Horn  independently  addressed 
the  problem  in  a  rigorous  way  and  provided  a  uniqueness 
analysis  for  planar  surfaces  (Negahdaripour  and  Horn,  1985). 
Later,  Horn  and  Weldon  (1986)  worked  on  the  case  of  non- 
planar  surfaces  and  pure  translation  and  devised  a  method 
that  searches  for  the  focus  of  expansion  and  works  under 
some  assumptions.  They  also  provided  a  robustness  analysis 
for  the  case  of  pure  rotation.  In  the  meantime,  independently 
fluang  talked  about  correspondenceless  detection  of  3-D 
motion  in  the  case  of  planes  in  a  topical  meeting  at  an  Opti¬ 
cal  Society  Conference  (Huang,  1985)  and  Amari  and  Maru- 
yama  in  Japan  independently  solved  the  problem  for  the  case 
of  planar  surfaces  (Amari  and  Maruyama,  1986).  In  the 


meantime  Kanatani  was  improving  his  techniques  for  planar 
surfaces  and  for  polyhedral  ones  under  the  assumption  of 
knowledge  of  the  structure  (Kanatani  and  Chou,  1987). 

This  is,  according  to  our  knowledge,  the  sequence  of 
research  efforts  up  to  the  present. 

3.  KNOWN  DEPTH 

Here  we  treat  the  case  of  passive  navigation  when  the 
depth  of  points  in  the  scene  is  known.  In  the  differential  case 
the  problem  is  trivial.  Indeed,  the  optic  flow  (u,  v)  at  every 
point  is  given  by 

?  +  uxxy  -w’j(j0-r  l)-rw3y  (1) 

v(z,y)=-^  +zy ^  +wq(y2+  1  )-w2iy  -w3z  (5) 

where  (U ,  V,  IF)  is  the  translation  and  f!  - —  (-i-'ii  ww,  w3)  t  he 
rotation  of  the  camera,  and  Z  the  depth  of  the  3-D  point 
whose  image  is  the  point  (z,t/).  Substituting  (4)  and  (5)  into 
(3)  we  get  a  linear  equation  with  unknowns  U,  V,  H',w>1,wo 
and  uj3.  Considering  this  equation  at  many  points  we  get  a 
solution  in  general,  using  a  least  squares  or  Hough  transform 
methodology. 

In  the  discrete  case,  the  problem  is  more  involved  and 
we  present  here  its  solution.  We  consider  a  set 
A  =  { (-V, ,  Yj,Zj),  t=l,  .  .  .  ,  n  }  of  3-D  points  that  under¬ 
goes  a  rigid  transformation  and  becomes  the  set 
A1  =  { (X>,  Y-,  Zl),  i  =  1,  .  .  .  ,  n  }.  We  wish  to  recover  the 
transformation  (that  transformed  set  .4  to  set  ,4')  without 
considering  any  point  to  point  correspondences.  A  part  of  the 
forthcoming  discussion  follows  (Aloimonos  and  Rigoutsos, 
1986). 

Any  rigid  motion  can  be  analyzed  into  a  rotation  plus  a 
translation;  the  rotation  axis  can  be  considered  as  passing 
through  any  point  in  space,  but  after  this  point  is  chosen, 
everything  else  is  fixed. 

If  we  consider  the  rotation  axis  as  passing  through  the 
origin  of  the  coordinate  system,  then  if  the  point 


922 


.■ -WVC". 


(AT/,  Y{,  Z-) €  A  moves  to  a  new  position  (A'/,  Y{,  Z') 6  A  ',  the 
following  relation  holds: 

(A7,  Y>,Zl)T  =R(Xi,  Yi,Z,)T  +T  .'  =  1,2,3 _ ,  n  (6) 

where  R  is  the  3X3  rotation  matrix  and 
T  =  (AA,  AX,  AZ)t  is  the  translation  vector.  We  wish  to 
recover  the  parameters  R  and  T,  without  using  any  point-to- 
point  correspondences. 

Let 

(A7.  >v,  Z’)’  SEE  P,  and  (A7,  Y>,  Z>)'  =  P>  .=1,2,3,.. 

Then,  equation  (6)  becomes 

P,  =  RP|  +  T  i  =  1,2,3,  ....  n 


Summing  up  the  above  n  equations  and  dividing  by  the  total 
uuinber  of  points,  n ,  we  get 


E  Pi  E  p. 

— —  =  R  -1— i — 
n  n 


+  T 


(7) 


From  (7)  it  is  clear  that  if  the  rotation  matrix  R  is  known, 
then  the  translation  vector  T  can  be  computed.  So,  in  the 
sequel,  we  will  describe  how  to  recover  the  rotation  matrix  R. 
In  order  to  get  rid  of  the  translational  part  of  the  motion  we 
shall  transform  the  3-D  points  to  “free”  vectors  by  subtract¬ 
ing  the  center-of-mass  vector. 

Let,  therefore,  CMA  and  CMA  be  the  center-of-mass 
vectors  of  the  sets  of  points  A  and  A'  respectively;  i.e. 
CMA  =  £)  (Pj  /  n)  and  CMA  —  53  (P// n  )•  We  furthermore 
define 

v,  =  P(  -  CMa  *  =  1,2,3,  .  .  .  ,  n 


v{  =  P{-  CMa  i  =  1,2,3,  ...  ,  n 

With  these  definitions,  the  motion  equation  (6)  becomes 

vj  =  R»|  i  =  1,2,3,  ....  n 

wheic  R  is  the  (orthogonal)  rotation  matrix. 

If  we  know  the  correspondences  of  some  points  (at  least 
three)  then  the  matrix  R  can  in  principle  be  recovered,  and 
such  efforts  have  been  published.  But  -  would  like  to 
recover  matrix  R  without  using  any  pc  .t  correspondences. 
Let 

vt  =  (t>v  Vy  t)2  )  1=1,2,  3 . n 

v{  =  ([/,',  v£,  v'J  i  =  1, 2,  3,  ....  n 

Note  that  v,  and  v{  arc  the  posit i- ■>.  vectors  of  the  members  of 
sets  A  and  A'  respectively  with  respect  to  their  center-of- 
mass  coordinate  systems. 

We  wish  to  find  a  quantity  that  will  uniquely  character¬ 
ize  the  sets  A  and  A'  in  terms  of  their  “relationship”  (rigid 
motion  transformation).  We  have  found  that  the  matrix  con¬ 
sisting  of  the  second  order  moments  of  the  vectors  v,  and  v( 
has  these  properties.  In  particular,  let 
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From  these  relations,  we  have 

V'=  E  Kvi,<y «"«<)“ 

i  =  l 

=  E  R(«v  V  »J(»v  vv,  > )  R1 = RVR' 

1  =  1 

So, 

V'  =  R  V  R'  (8) 

At  this  point  it  should  be  mentioned  that  equation  (8) 
represents  an  invariance  between  the  two  sets  of  3-D  points  A 
and  .4',  since  the  matrices  V  and  V'  are  similar.  In  other 
words  we  have  discovered  that  matrix  V  remains  invariant 
under  rigid  motion  transformation.  From  now  on,  the 
recovery  of  the  rotation  matrix  R  is  simple  and  comes  from 
basic  linear  algebra.  Furthermore,  equation  (8)  implies  that 
the  matrices  V  and  V'  have  the  same  set  of  eigenvalues 
[Stewart,  1980). 

But  since  V  and  V"  are  symmetric  matrices,  they  can  be 
expanded  in  their  eigenvalue  decomposition,  i.e.  there  exist 
matrices  S ,  T  such  that 

V  =  SDS‘  (9) 


V'=TDT‘  (10) 

where  S,  T  are  orthogonal  matrices  having  as  columns  the 
eigenvectors  of  the  matrices  V  and  V1  respectively  (e  g.  the 
ilh  column  corresponding  to  the  ilh  eigenvalue)  and  D  is  a 
diagonal  matrix  consisting  of  the  eigenvalues  of  the  matrices 
V  and  V'.  We  have  to  mention  at  this  point  that  in  order  to 
make  the  decomposition  unique  we  require  that  the  eigenvec¬ 
tors  in  the  columns  of  matrices  S  and  T  be  orthonormal. 

From  equations  (8),  (9)  and  (10)  we  derive  that  matrices 
T  and  RS  both  consist  of  the  orthonormal  eigenvectors  of 
matrix  V'.  In  other  words,  the  columns  of  matrices  RS  and 
T  must  be  the  same,  with  a  possible  change  of  sign.  So,  the 
matrix  RS  is  equal  to  one  of  eight  possible  matrices 
T,  ,.  =  l,  .  .  .  ,8.  Thus,  R  =  T<St  ,  *  =  1 .....  8.  But  the 
rotation  matrix  is  orthogonal  and  it  has  determinant  equal  to 
one.  Furthermore,  if  we  apply  matrix  R  to  the  set  of  vectors 
v,  then  we  should  get  the  set  of  vectors  v(.  So,  given  the 
above  three  conditions  and  Chasles’  theorem,  the  matrix  R 
can  be  computed  uniquely.  Thus,  we  can  compute  3-D 


motion  without  point  correspondences.  The  robustness 
against  noise  in  this  method  (and  noise  arises  from  several 
sources)  is  studied  elsewhere.  Here,  we  wanted  to  theoreti¬ 
cally  demonstrate  the  feasibility  of  the  approach. 


plane)  — *  (object  surface)  is  defined  such  that 

P  (PA-,(x))  =  x.  The  optic  flow  at  point  x  is  given  by 


i=(M)  =  -lP(X)  =  4£-V(X). 


4.  KNOWN  SHAPE 


Here  we  treat  the  egomotion  problem  when  the  shape  of 
the  object  (or  scene)  in  view  is  known.  The  formulation  and 
treatment  here  follow  Ito  and  Aloimonos  (1987).  Assume  that 
a  surface  II',  whose  shape  is  known,  is  moving  rigidly  while 
its  images  are  taken  (Figure  2). 


Clearly  — —  and  V (X)  are  functions  of  X,  but  using  the 
oX 


inverse  projection  (from  the  image  to  the  surface),  we  can 
consider  them  as  functions  of  x  (retinal  coordinates). 


Figure  2. 


Let  the  motion  of  the  surface  consist  of  a  translation 
T  —  (<i,  t3,  t3)  and  a  rotation  fl  =  (ut|, ui2l w3).  So,  if 
X  =  (.Y,  Y.  Z)  is  a  point  on  the  surface  IF,  then  its  velocity 
V(X)  =  (F„  Vo,  V3)  =  T  +  f 1XX.  This  velocity  can  be 


So'  HHS  (p*-,(x)>  and  v=v(/V'(x» 


since  V  ==  V]  mk  Vt ,  we  have 


written  as  V (X)  =  mtVt(X),  where  mt  are  the  motion 
£  —  1 

parameters  and  Vt  arc  fixed  vector  fields.  Indeed 


x  =  E  mk  <t >k  (*) 


ml  =  t,  V,  (X)  =-(l00)r 
m2=( ,  V2(X)  =  (0  l  0)r 

"*3=  <3  V3(X)  —  (00  l)r 
mt  =  V4(X)  =(0 -Z  Y)t 

nii-v,  V5(X)  —  (Z  0  -X)T 
m,  --  w:1  V6(X)  —(-Y  X  0)T 


*k  W  =  ^lp*->W)V*  ('VW) 


(Z0-X)T 
( -  Y  X  0) T 


It  has  to  be  emphasized  that  <!>*  are  functions  of  the  shape, 
and  so  are  known. 


4.1.  Linear  features 


Let  P  denote  the  perspective  projection  (from  the  object  sur¬ 
face  to  the  image  plane),  i.e.  if  x  =  (r,t/)  is  the  image  of 
point  X-—(.V,  V,Z)£  IF.  then 


A  linear  feature  l  is  a  linear  functional  over  the  image 
S  (z,y  ),  i.e. 


5  (r,  y)jt(x.  y)  duly 


P  (X)  —  (P,(X).P.,(X))  =  I&  Jj\  =  (*.*)- 


where  /  is  the  focal  length  of  the  camera,  assumed  here  to  be 
unity.  Since  the  shape  of  the  object  in  view  is  assumed  to  be 
known  (let  it  be  the  function  A),  a  mapping  PA  i  (image 


where  n(i,  y)  is  a  “measuring”  analytic  function,  and  the 
integration  is  taker,  over  the  area  I)  of  interest.  A  linear 
feature  vector  I  is  a  vector  whose  components  are  linear 
features  (Amari.  19B8),  i.e. 


R 


jtv  n  )munu\  tut  jvwt  vwwnv  .->  «-»  y*  it*  ywt  ir»jr^ir»jmjr>t . 


1  =  (/ 1, /2 . /„]r  where 

‘<=ffS>“’ 

D 


where  jj,-  is  a  measuring  function  for  »  =  1,  2,  3,  .  .  .  ,  n .  The 
functions  /ij  could  be  any  set  of  “nice”  functions,  such  as 
moments  (x,y,xy,x 2y ,  etc.),  or  e * (p1  +  ?» ),  [n  which  case  a 
linear  feature  corresponds  to  a  Fourier  component  of  the 
image. 

A  linear  feature  is  a  global  image  feature,  i.e.  a  global 
characteristic  of  the  whole  image.  Using  different  measuring 
functions  we  get  different  global  measurements. 


4.2.  The  constraint 

Since  there  is  motion,  the  induced  optic  flow  field 
satisfies  the  following  equation  (Horn,  1985) 

Sx  u  4-  v  4-  S,  =  0  (12) 


where  ( u,v )  is  the  flow  at  a  point  {x,y)  and  Sz,  S„,  S,  are 
the  spatiotemporal  derivatives  of  the  image  intensity  function 
at  that  point. 

Equation  (12)  can  be  written  as 


ds 

dt 


—  -x  ■  y  S 


The  time  derivative  of  a  linear  feature  vector  1  will  be 


1  —  |/i,  Z2, 


where 


In  other  words, 

i  =  H  m  ,  (13) 


where  H  —  {hit)  and  m  =(tnlt  m2,  .  .  .  ,  m6)T  are  the  motion 
parameters. 

Equation  (13)  relates  the  unknown  motion  parameters  to 
the  measurable  1  and  the  entries  of  matrix  IF  which  involve 
the  shape  parameters,  which  are  assumed  to  be  known. 

Of  course  there  is  a  degeneracy,  and  we  cannot  solve  for 
all  the  motion  parameters  (only  the  direction  of  translation 
and  the  rotation). 

A  simple  least  squares  method  can  solve  for  the  parame¬ 
ters  m* . 


To  give  an  example,  let’s  consider  the  surface  in  view  to 
be  planar,  with  equation  Z  =  pX  4-  qY  +  c  with  respect  to 
the  camera  coordinate  system.  In  that  case  we  know  p  and 

t i  / 2  Z  3 

o.  The  parameters  m :  are  mi  =  — ,  m2  = — ,  m3  = — , 

c  c  c 

m4=w i,  ms  =  o>2,  and  ms  =  <v3,  and  the  functions  <!>*  are 


*i(x)  =  (l-P*  -W,0) 

*2(x)  =  (0,  l -px-qy) 

*3(x)=  (-*(i -p*  -m),-y  (i -p*  -?y)) 
*<(*)  =  (*jmc  +  i) 


4>6(x)=(-(z2+1),-Z!,) 

*#(x)  =  (y.-*)  , 


where  the  plane  is  moving  with  translation  T  =  (<i,  Z2,  Z3) 
and  rotation  ii  —  (uq,  w2,  W3). 


-•  =  // 


Pi  dxdy  = 


-//  Pi  (x 


y5)  dxdy 


The  optic  flow  field  from  equation  (11)  can  be  written  as 

r*«  (*n 


X  =  E  mt  =  X)  "<* 

*=1  *  =  1 


<*>2*  (x)J 


where  4>lt,  4>2i  are  the  components  of  4>t. 
On  the  other  hand, 


A  =JJ  Pi  (x  '  V-5-)  dxdy  = 


4.3.  Summary 

We  have  presented  a  method  for  the  correspondenceless 
recovery  of  motion,  given  that  the  shape  is  known.  The  algo¬ 
rithm  can  be  summarized  as  follows 

1.  The  input  is  a  sequence  of  images  S(x,y,t)  and 
S  (x ,  y,  t  +  dt). 

2.  Choose  a  set  of  measuring  functions  (x ,  y). 

3.  Create  a  linear  feature  vector  1=  [lj,  Z2,  .  .  .  ,  ln  \  where 

A  =  Jj  S(x,y)li(x,y)dxdy 

4.  Compute  the  change  of  1,  i.e.  the  time  derivative  1,  from 
the  images  5  (j,  y,  t)  and  S  (x ,  y ,  t  +  dt). 

5  Compute  the  entries  of  matrix  //.  For  example,  in  the 
planar  case 


-E 


*  -  1 


W* 


1*  St  +4>o *  Sy)dxdy 


h  is  — 


5,  +<J>2 


Sv )  dxdy 


n 

or  /  ~  E  ’"t  11  >k  w>th 

k  =  1 


h  ,,  = 


f  f  Pi  (*(p*  +  <?y  - 1) +  y  (pj  - 1  )  dxdy 


h, k 


■  <f>2j  S„)  dxdy 


6.  From  equation  1  =  //  in,  solve  for  m  using  a  least- 
squares  methodology. 

Finally,  we  wish  to  stress  here  the  fact  that  in  this  algorithm 
the  spatial  derivatives  of  the  image  intensity  function  don't 


need  to  be  computed.2  This  is  due  to  the  fact  that  we  use 
linear  features  (integration  by  parts  avoids  differentiation  of 
the  intensity  function;  instead,  the  derivative  of  the  measur¬ 
ing  function  /ij  needs  to  be  computed,  and  this  is  well 
defined).  So  we  don’t  have  to  do  numerical  differentiation, 
which  is  an  ill-posed  problem  (Poggio  et  al.,  1986).  More 
importantly,  the  same  approach  can  be  followed  when  the 
image  intensity  function  is  discontinuous.  To  further  clarify, 
consider  the  image  to  be  a  dot  pattern,  i.e.  consisting  of  the 
points  (xj,  jft),  i  =  l,2,3,  .  .  .  ,n.  Then  the  moving  image  is 

n 

S  (x  ,  y ,  t)  —  2  6  (x  ~  xi  (< ),  y  -  Vi  (0)  -  In  that  case  a  linear 

i  =  l 

feature  is  trivially  computable.  So,  the  algorithm  we  have 
presented  works  for  nondifferential  image  functions  as  well. 
We  find  the  concept  of  linear  features  (introduced  by  Amari 
and  communicated  to  vision  researchers  by  Kanatani)  to  be 
very  powerful  in  cybernetics  research. 


5.  THE  GENERAL  CASE:  TOWARDS  A  THEORY 
OF  OPTIC  VOLUME  VISION 

Here  we  treat  the  general  case,  i.e.  the  case  where  an 
object  is  moving  rigidly  and  we  wish  to  compute  its  structure 
and  motion  without  correspondence,  using  only  one  camera. 
Our  input  will  be  the  contours  in  the  object  (any  kind  of  con¬ 
tour).  In  other  words,  the  object  is  considered  to  consist  of 
curved  or  straight  (non)planar  lines.  In  the  case  of  egomotion 
such  contours  can  be  the  zero-crossings  in  the  image,  whose 
computation  can  be  considered  robust  enough,  according  to 
recent  literature  (Poggio,  1986).  Our  input  will  be  the  time 
varying  image  of  the  contours,  i.e.  a  surface  in  3-D  space 
which  is  the  Cartesian  product  of  the  image  (2-D)  and  time 
(1-D).  In  other  words,  assuming  that  we  can  continually 
accumulate  images  and  we  put  one  after  the  other,  then  the 
image  contours  will  trace  a  set  of  surfaces.  By  examining 
these  surfaces,  we  can  determine  constraints  among  motion, 
shape  and  the  curvature  of  the  surface.  We  do  not  use  any 
intermediate  representations  such  as  optic  flow  or  correspon¬ 
dence,  and  in  this  way  we  avoid  the  aperture  problem. 


5.1.  Mathematical  formulation 

Keeping  our  previous  nomenclature  (3-D  coordinates 
denoted  by  capital  letters  and  retinal  coordinates  by  small 
ones),  we  proceed  to  mathematically  formulate  our  problem. 

A  line  L  in  3-D  space  can  be  parameterized  in  one 


parameter,  i.e. 


X 

\X{o) 

L  : 

Y 

= 

Y{o) 

Z 

Z(o) 

or  X  =  X(< t) 


under  the  assumption  that  this  function  is  differentiable  as 
many  times  as  we  wish.  But  since  we  are  considering  a  collec¬ 
tion  of  lines,  this  function  has  finitely  many  points  of  discon¬ 
tinuity,  but  it  does  not  matter  here  since  we  will  consider  only 
differentiable  points. 

The  line  L  is  moving  rigidly.  So,  L  and  X  are  functions 
of  time  r,  i.e.  L  =  L  (r)  and  X  =  X(rr,  r). 


Let  w  ==  (tJi, w2, CU3)'  be  the  rotational  velocity  around 


the  A',  and  Z  axes  (see  Figure  1)  and  T  =  (/,,  l2l  l3)r  the 


translational  velocity.  Of  course  the  motion  Ls  not  constant, 
i.e.  ui  and  T  are  functions  of  time:  w  =  oi(r)  and  T^T(r). 


2Ilowcver,  we  need  to  differentiate  in  the  time  dimension 


The  velocity  at  point  X  of  the  line  L  is 


ax 


X=^  =  f!X  +  T,  where 
dr 


0  = 


0  —CU3 

CJ3  0  — cUj 

If,1]  0 


The  acceleration  at  point  X  is 

^  d*x. 


d r2 


A  A£  =  A  iqx  +  t)  = 

dr  dr  dr 


=  nx  +  nx  +  T  =  n(nx  +  T)  +  nx  +  T 


But,  assuming  inertial  motion  (i.e.  ui  =  T  =  0),  we  have 

x  =  nx 


Using  perspective  projection  P,  we  find  that  the  projection  of 
point 


X  =  (X,  Y ,  Z)  is  the  point  x  =  (x ,  y),  with 


P(X)  =  x  =  (*,2,)=|f.y 


where  we  have  assumed  the  focal  length  to  be  unity.  The 
image  of  a  point  X(cr,  r)  on  the  line  L  is  a  point  x  (<r,  r),  i.e. 
P  (X(<t,  r))  =x(<r,  r).  The  image  of  the  line  L  (r)  will  be 
denoted  by  l  (r),  a  line  on  the  image  plane  (xy-plane). 

So  far  we  haven’t  specified  how  a  should  be  taken;  we 
consider  it  to  be  the  normal  coordinate  of  the  line  l  (r0),  i.e.  <r 
is  the  length  from  point  x  (0,  r0)  to  point  x  (<7,  r0)  along  the 
line  l  (r0). 

The  shape  of  a  line  (say  at  time  t0)  is  the  shape  of  the 
functions  A'(<r),  Y(o)  and  Z(o),  of  which  the  shape  of  Z(o)  is 
sufficient.  (Indeed,  since  X  =  xZ ,  Y  —  yZ  with  x,y  observ¬ 
able  from  the  image,  the  shape  of  Z  (o)  is  sufficient.) 

So,  shape  reconstruction  means  determination  of  the 
function  Z  (o).  Our  purpose  is  to  obtain  w,  T  and  the  func¬ 
tion  Z  (<t).  From  now  on,  we  may  denote  (x,y)  as  x  or  x' 
and  (o,  t)  as  o  or  o' ,  where  the  superscripts  move  from  1  to  2 
and  x1,  x2,  ol  and  o2  mean  x,  y,  o  and  r  respectively. 

In  the  rest  of  this  section  we  develop  some  relations  that 
will  be  of  use  later.  First,  let  us  consider  the  derivatives  of  x 


at  time  rn. 


dx 


Directly  from  the  definition  of  <7,  -  is  a  unit 

do 


d'x 


vector  along  the  line.  On  the  other  hand,  — is  a  vector 

do~ 

perpendicular  to  the  line  and  its  norm  represents  the  curva- 
C^X 

ture  of  the  line  Also,  -  is  the  optic  flow 

or 


dx  3  „  (y(  if  dP  <)X 
—  =  -P(X(.»  =  — — 


dP 


ax 


(nx  +  T) 


vith 


ap 

ax 


on, 

<»\ 

or, 

ox 

0Y 

oz 

ar2 

or. 

dr ... 

a  a 

OY 

oz 

926 


£ 


w  v  v  •jrArjuaryFTX&&Z3rzrr& 


where  P(X)=  (P,  (X),  P2(X))  =  (*,y). 


'l 

0 

-X 

V 

we  have 

A  = 

0 

1 

-ym 

b  = 

y 

l 

3P 

ax 


-L  o  -* 

z 


Zi 

Y 


0  - - 

Z  z2 


■  1  0  -x~ 

J_ 

z 

_  0  1  -y_ 


Also, 


0  -xl 

I  0  -j"]  »T» 

X 

i  _J(nx+T)  = 

o  i  -y  '  f  +  n 

y 

J 

L  J 

l 

_ 

On  the  other  hand,  we  have 

a2x 


dido 


\  a  f1  0  -*1  1  Tt  _  l1 

[a;  [° 1  -®J J  *  n  li 


■■f  ■ 

O 

1 

4 

— 1 

1) 

T  ~ 

X 

[  +  0  1  -y]  <3<r 

L  J 

T  +  n 

y 

l 

0  0- 

0  0- 


d£ 

da 

dy 

da 


h« 


[i  “  3 


dx 

dz 

da 

t  ,  q 

dy 

Z2 

da 

0 

Furthermore, 


d2x 

dr1 


d_ 

dr 


OP 


ax 


d_ 

dr 


OP 


ax 


■£[£]**s* 


The  first  term  of  the  above  is  further  simplified  to 


0_ 

dr 


OP 


ax 


X  ,  IX z 


z- 


z 


_ Y_  2YZ_ 

Z2  Z2  Z3 


x  = 


1 

z2 


'-2  A'Z  +  2.YZ2 

/  7~ 

_  _2Z 

r.Y  -  xZ 

-2  rz  +  21'Z2 

iz\ 

z2 

\_Y-yZ 

TZ  J_ 
Z  Z 


1  0  -x 
0  1  y 


-2  Z  dx 
Z  dr 


Z  T  3 

But  — - u)<iX  -f  u-qy  -f-  — . 


In  order  to  improve  readability,  define  matrix  .1  and 
vector  b  as 


These  are  dependent  only  on  x.  Then 


X  T  ,  OK 

T  =  T  +  fib 


•  dx  .  X 
x  =  — —  =  A  — 
dr  Z 


■■  a2x  -2Z  .  „  x 

"I?-—  *AnY 


5.2.  xyr  space:  optic  volume 


This  section  introduces  the  idea  of  the  xyr  space  (first 
originated  in  Bolles  et  ah,  1987).  Suppose  that  we  have  taken 
many  images  of  the  object  (line)  at  very  short  intervals  and 
arranged  them  as  in  Figure  3.  If  we  consider  the  limit  where 
the  time  intervals  go  to  zero,  we  will  come  to  the  conception 
of  a  3-D  space  having  x,  y  and  r  axes  (see  Figure  4).  In  this 
xyr  space,  a  moving  (1-D)  line  is  expressed  by  a  2-D  surface 
(the  surface  that  the  line  traces  as  it  moves)  (Figure  4).  It  is 
this  surface  that  we  will  use  as  our  input.  This  surface,  we 
assume  for  the  rest  of  the  exposition,  is  directly  observable. 


dx 


From  now  on,  we  will  assume  that  the  optic  flow  -- —  is 

dr 


not  in  the  same  direction  as  that  of  the  line  at  that  point;  we 
will  only  consider  such  points. 

This  condition  means  that  the  Jacobian  is  not  equal  to 
zero,  i.e. 


dx 


do 


*0, 


that  is,  there  exists  an  inverse  function  from  x  to  e  in  the 

dr  d2r 


neighborhood  of  x.  Note  that  the  parameters 


are  observable  in  the  surface,  while 


d2o 


Ox' 


d2o 


dx'  dy 1 
are  not. 


dx‘  dx ’  Ox’ 

In  the  xyr  space,  a  trajectory  of  a  point  is  expressed  as  a 
line.  The  above-mentioned  surface,  i.e.  the  trajectory  of  a 
line,  consists  of  such  point-trajectories.  However,  each  point 
trajectory  is  not  identifiable  in  the  surface;  they  are  mixed  up 
with  each  other.  This  is  what  we  call  the  aperture  problem. 
The  mixing  of  the  trajectories  leads  to  an  inconvenience  (since 
it  is  the  aperture  problem).  However,  we  will  show  in  the 
next  section  that  the  derivatives  of  the  xyr  surface  are 
directly  related  to  the  3-D  motion  parameters.  So,  by  just 
examining  the  shape  of  the  surface  (xyr)  no  aperture  problem 
arises,  since  we  treat  the  xyr  surface  as  a  “surface’’  and  not 
as  a  collection  of  lines.  It  is  this  xyr  surface  that  we  call  the 
“optic  volume”,  since  it  contains  all  the  information  about  3- 
D  motion  and  structure. 


5.3.  The  constraint 


Clearly,  the  quantities 


dr 


d~r 


dx'  dx'  dy1 
observable.  On  the  other  hand,  consider 


and 


dx' 

do 


dr 

dr 


dr_  0x_ 
dx  Or 


dr  dy_ 
dy  dr 
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Vi 


-.V 


I  ,»  » 


(^T)5/;  («)  +  (1  -  *,Tn  b, )  (4>,tT)s  g,  (fi)  T  + 

(1  -4>,rnb,)2Tr  F,  T  =0 

Clearly,  the  ability  to  recover  3-D  motion  without 
correspondence  (for  a  motion  with  zero  acceleration)  depends 
on  whether  or  not  we  can  solve  the  above  nonlinear  system. 
This  is  possible  in  some  cases,  but  in  general,  since  iterative 
methods  are  used,  the  result  of  the  convergence  might  be  far 
from  the  solution.  A  closed  form  solution  would  be  very 
much  desirable.  On  the  other  hand,  a  five-dimensional  Hough 
transform  (Ballard,  1985)  can  solve  the  problem,  given  that  a 
lot  of  space  and  computing  power  are  available. 

6.  CONCLUSIONS  AND  FUTURE  DIRECTIONS 

We  have  presented  a  mathematical  analysis  that  indi¬ 
cates  that  the  problem  of  passive  navigation  might  be  solv¬ 
able  without  correspondence.  This  is  true  in  the  case  where 
prior  information  such  as  depth  or  shape  are  known  In  the 
general  case,  we  have  outlined  a  constraint,  and  the  solution 
comes  from  solving  a  nonlinear  system.  Even  though  solving 
such  a  system  is  not  impossible,  '*  would  be  desirable  to 
obtain  a  closed  from  solution  that  will  enable  us  to  quickly 
solve  for  the  motion  parameters  and  develop  a  strict  unique¬ 
ness  analysis  of  the  problem.  We  consider  this  problem  to  be 
an  important  future  research  goal.  As  to  the  question  “is 
correspondence  necessary  for  the  perception  of  structure  from 
motion”,  the  answer  is  “no”  in  principle,  but  more  research  is 
required  in  order  to  establish  a  closed  form  solution  and  a 
uniqueness  analysis.  After  this  is  done,  the  next  important 
question  is  to  discover  which  approach  is  more  robust:  the 
correspondence-based  or  the  correspondcnceless  one.  Finally, 
at  this  point  we  would  like  to  refer  to  work  by  Todd  (1985), 
where  it  is  demonstrated  through  psychological  experiments 
that  the  ability  of  humans  to  perceive  structure  from  motion 
is  much  more  general  than  would  be  reasonable  to  expect  on 
the  basis  of  correspondence-based  theories,  and  it  is  suggested 
that  the  computational  theories  will  have  to  be  modified,  if 
they  are  to  account  for  the  high  level  of  generality  exhibited 
by  human  observers. 
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ABSTRACT 

This  paper  considers  the  problem  of  the  recovery  of  motion 
parameters  of  a  rigid  object  moving  through  an  environment  with 
constant  but  arbitrary  linear  and  angular  velocities.  The  method 
uses  temporal  information  from  a  sequence  of  images  such  as 
those  taken  by  a  mobile  robot.  Spatial  information  contained 
in  the  images  is  also  used.  The  temporal  sequence,  combined 
with  the  assumption  of  constant  velocities,  provides  powerful  con¬ 
straints  for  the  motion  trajectory  of  rigid  objects. 

We  derive  a  closed  form  solution  for  the  rigid  object  trajectory 
by  integrating  the  differential  equations  describing  the  motion  of 
a  point  on  the  tracked  object .  The  integrated  equations  are  non¬ 
linear  only  in  angular  velocity  and  are  linear  in  all  other  motion 
parameters.  These  equations  allow  the  use  of  a  simple  least- 
square  error  minimization  criterion  during  an  iterative  search  for 
the  motion  parameters.  Experimental  results  demonstrate  the 
power  of  our  method  in  fast  and  reliable  convergence. 

1  INTRODUCTION 

1,1  Background 

The  interest  in  temporal  analysis  of  image  sequences  has  sharply 
risen  with  the  increase  of  available  computational  power  [10]. 
Bolles  and  Baker  argue  that  equal  weight  be  given  to  both  tem¬ 
poral  and  spatial  information  contained  in  a  sequence  of  images 
(frames).  Two-image  motion  algorithms  often  use  spatial  inte¬ 
gration  [18],  However,  in  a  two-image  analysis  the  only  temporal 
information  that  can  be  used  is  differential  (positions  of  image 
points  in  two  close  time  instances).  This  information  is  too  lo¬ 
cal  in  scope  and  too  sensitive  to  errors.  In  this  paper  we  argue 
for  more  extensive  use  of  temporal  information.  Temporal  in¬ 
formation  is  obtained  from  measurements  of  distinctive  feature 
positions  in  the  image  plane  in  several  time  instances  (images). 
A  distinctive  feature  is  labeled  with  both  spatial  and  temporal 
coordinates  and  is  called  an  image  event.  The  time  coordinate 
is  treated  on  an  equal  footing  with  spatial  coordinates.  Thus, 
the  idea  of  smoothing  can  be  applied  to  time  coordinate  as  well, 
resulting  in  reduced  sensitivity  to  small  errors  in  spatial  coordi¬ 
nates.  Another  important  feature  of  our  method  is  the  integra¬ 
tion  of  temporal  information  which  is  achieved  by  solving  differ¬ 
ential  equations  of  motion.  By  integrating  temporal  information 
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over  several  images  we  get  more  reliable  information  about  the 
motion  trajectory. 

There  is  evidence  that  temporal  integration  is  performed  by 
humans,  as  demonstrated  in  [21],  It  is  only  the  integration  of 
temporal  information  that  is  discussed  here  in  greater  detail. 

Computations  of  displacement  fields  [1,6,16,17,22,23,24,38,39] 
by  their  nature  give  more  weight  to  the  spatial  information  in  im¬ 
age  features.  Computations  are  sensitive  to  noise  in  the  image  be¬ 
cause  they  rely  on  derivatives  of  feature  displacements.  The  mo¬ 
tion  parameters  are  obtained  from  a  set  of  high-order  non-linear 
equations,  resulting  in  costly  computations,  low  precision  and 
unstable  solutions.  The  deficiencies  of  these  approaches  are  espe¬ 
cially  visible  in  problems  of  motion  segmentation  and  occlusion. 
Motion  algorithms  using  multiple  images  from  an  image  sequence 
have  been  developed  by  many  authors  [2,7,10,14,32,34.42] . 

The  research  in  this  paper  has  been  motivated  by  the  prob¬ 
lem  formulation  of  Shariat  [34]  finding  the  motion  parameters  of  a 
rigid  object  moving  with  constant  translational  and  rotational  ve¬ 
locities,  under  the  assumption  that  five  images  are  equally  spaced 
in  time.  These  assumptions  enable  Shariat  to  propose  a  set  of 
difference  equations  relating  the  positions  of  features  in  the  im¬ 
age  plane  and  the  motion  parameters  of  the  projected  point.  The 
solution  he  obtains  is  a  set  of  non-linear  polynomial  equations  in 
unknown  motion  parameters.  The  equations  are  of  a  rather  high 
(5th)  order.  The  motion  parameters  are  obtained  using  a  Gauss- 
Newton  non-linear  least-square  method  with  carefully  designed 
initial-guess  schemes. 

Our  work  also  relates  to  that  of  Broida  and  Chellappa  [13,14] 
due  to  the  fact  that  it  considers  estimation  of  motion  parame¬ 
ters  of  a  rigid  body  from  several  images.  Broida  and  Chellappa 
[14]  are  estimating  kinematic  parameters  of  a  rigid  body  and  its 
structure  by  tracking  several  feature  points  in  a  sequence  of  im¬ 
ages.  The  object  of  their  work  is  to  estimate  the  influence  of 
white  noise  in  feature  positions  on  the  recovery  of  motion  pa¬ 
rameters  and  to  find  the  robustness  of  the  method  to  bad  initial 
guesses.  This  is  important  because  the  point  feature  extraction 
methods  return  feature  positions  below  the  expected  precision 
of  the  algorithm.  Their  work  can  be  used  in  the  prediction  of 
the  feature  positions  in  the  next  frame(s).  It  is  based  on  itera¬ 
tive  Kalman  filtering  techniques  [12],  and  stochastic  estimation 
techniques,  (good  references  on  the  theory  behind  stochastic  es¬ 
timation  techniques  used  in  their  work  can  be  found  in  [20.27]). 
Wiinsche  [41]  has  a  similar  approach.  The  prediction  of  the  fea¬ 
ture  in  the  next  frame  usually  leads  to  a  reduced  search  space  in 
the  feature  correspondence  problem,  but  we  have  not  considered 
this  important  issue  in  the  present  paper. 


Our  work  concentrates  on  the  computer  vision  aspect  of  the 
problem,  and  in  a  different  way  proves  that  using  several  frames  is 
a  good  way  to  detect  objects  moving  in  the  camera’s  field  of  view 
or  detect  camera  movement.  We  argue,  in  the  sense  of  Lowe’s  (25j 
uniqueness  of  the  viewpoint  constraint,  that  the  image  positions 
of  the  same  feature  in  several  frames  {i.e.,  several  time  instances) 
provide  a  powerful  constraint  on  the  possible  types  of  object  mo¬ 
tion.  We  might  call  the  time  constraint  ’’the  uniqueness  of  the 
time  sequence  constraint’’ .  This  constraint  reflects  deterministic 
motion  of  a  rigid  body  whose  initial  position  and  velocities  are 
known.  The  time  constraint  combined  with  the  Lowe’s  spatial 
constraint  is  an  excellent  tool  for  detection  of  rigid  body  kine¬ 
matics  and  shape  from  motion. 

Our  approach  begins  with  the  set  of  differential  equations 
describing  the  motion  of  a  point  P  on  a  rigid  body  moving  with 
constant  translational  and  rotational  velocity  through  the  envi¬ 
ronment.  The  equations  are  solved  analytically,  i.e.,  a  set  of  para¬ 
metric  equations  P  =  (X(t),Y(t),  Z(t))  describing  the  helix-like 
trajectory  of  the  point  P  is  found.  The  parameters  in  these  equa¬ 
tions  are  the  initial  position  and  the  linear  and  angular  velocity 
of  the  point  P.  Our  goal  is  to  determine  these  parameters  from 
the  central  projection  of  the  trajectory  of  the  point  P  on  the 
image  plane. 

The  central  projection  of  the  point  P  on  the  imjige  plane 
is  labeled  Q.  This  distinctive  feature  is  characterized  by  im¬ 
age  coordinates  ii,y,  aud  the  time  label  t,  of  the  image:  Qi  = 
<<)•  '  =  •  •  •  > n.  where  n  is  the  number  of  considered  im¬ 

ages.  We  will  call  Q f  image  events.  Under  the  assumptions  of 
the  known  projective  geometry  ana  constancy  of  motion  param¬ 
eters,  only  a  few  image  events  Q,  are  needed  to  find  the  motion 
parameters  of  the  environmental  point  P.  The  reason  is  that 
we  know  (from  the  closed  fotm  solution  for  the  trajectory)  what 
constraints  exist  between  image  events  Q,  and  the  motion  pa¬ 
rameters  of  the  point  P.  The  constraints  are  non-linear  only 
in  rotational  parameters  and  the  type  of  non-linearity  is  known 
exactly,  fn  our  method,  there  is  no  constraint  imposed  on  the 
time-interval  between  images,  and  the  equations  can  be  more 
readily  adapted  for  motion  of  non-rigid  objects,  as  well  as  for 
accelerated  motions  by  using  perturbation  techniques  (15,29|. 

We  do  not  consider  the  problem  of  feature  correspondence 
over  several  images  There  are  indications  that  the  correspon¬ 
dence  problem  can  be  handled  quite  successfully  under  the  con¬ 
ditions  of  an  approximately  known  displacement  path  (3,4,5,8,9|. 
Bharwani  uses  correlation  measurements  and  the  history  of  the 
motion  to  predict  the  position  of  a  point  feature  in  the  next 
frame  using  ths  concepts  developed  by  Anaudan.  This  approach 
has  some  difficulties  in  trying  to  predict  the  feature  position  with 
accuracy  less  than  a  half  of  pixel,  due  to  the  shallow  nature  of 
the  correlation  measure  at  this  resolution.  The  feature  predic¬ 
tion  was  proven  to  be  very  helpful  both  for  efficiency  and  for 
more  reliable  feature  matching. 

Sethi  and  Jain  SET87.  for  example,  use  "smoothness  of  mo¬ 
tion"  to  reduce  the  search  explosion  of  possible  correspondences 
of  several  image  features  through  several  images.  Once  a  good 
initial  guess  for  motion  parameters  is  found  the  search  space  of 
correspondences  can  be  drastically  reduced  and  possible  corre¬ 
spondences  become  better  defined  as  the  computation  proceeds. 

We  plan  to  attack  the  feature  correspondence  problem  by  us¬ 
ing  symbolic  features  such  as  lines,  and  perhaps  regions  In  the 
case  of  line  matching  we  can  choose  feature  points  to  bp  at  the 
intersection  of  two  lines,  and  the  line  correspondence  ran  be  es¬ 
tablished  using  methods  similar  to  those  in  Medioni  2*  In  this 
case  the  major  practical  problem  is  to  establish  the  correct  line 


correspondence.  The  extra  effort  is  rewarded  by  greater  robust- 
ness  to  noise. 

The  other  possibility  is  to  choose  lines  themselves  as  features 
Spetsakis  and  Alo, monos  [36j  use  three  frames  and  13  line  corre¬ 
spondences  to  establish  the  translational  and  rotational  param¬ 
eters  of  the  object  whose  lines  are  being  tracked.  They  develop 
a  method  resulting  in  a  linear  system  of  equations,  thatt  authors 
report  to  be  noise  sensitive.  We  plan  to  develop  a  theory  that 
will  track  the  motion  of  lines  through  several  frames.  The  theory 
will  have  similar  structure  as  the  one  presented  in  this  paper  and 
it  will  be  a  non-linear  theory  with  an  explicit  frequency  depen¬ 
dence.  We  expect  the  theory  to  be  less  noise  sensitive  than  the 
one  reported  by  [36|. 

More  recently  Williams  (40j  have  developed  a  spatio-temporal 
grouping  algorithm  that  produces  token-based  line  correspon¬ 
dences  across  a  sequence  of  images.  This  algorithm  appears  to  be 
robust  and  utilizes  two  other  algorithms  developed  at  the  Uni¬ 
versity  of  Massachusetts,  a  spatial  line  grouping  algorithms  by 
Boldt  and  Weiss  [11]  and  Anandan’s  algorithm  for  developing 
displacement  fields  with  confidence  measures  [4j. 

in  the  case  of  region  matching  we  can  track  region  centers 
(like  center  of  mass)  as  in  work  by  Price  [31],  Use  of  line  and 
region  correspondence  should  result  in  a  more  noise-robust  ap¬ 
proach  to  the  correspondence  problem,  although  we  predict  some 
difficulties  with  the  precision  of  these  methods. 

One  important  issue  not  considered  here  is  the  choice  of  a 
good  initial  guessing  scheme.  In  this  paper  we  used  simple  pa¬ 
rameter  estimation  techniques  (such  as  the  sign  of  the  path  cur¬ 
vature,  direction  of  motion,  magnitude  of  motion)  to  predict  the 
initial  velocities  and  position  of  the  tracked  feature.  We  hope  to 
incorporate  more  sophisticated  techniques  of  initial  estimation 
similar  to  those  seen  in  work  of  Shariat  [34] .  The  initial  guess¬ 
ing  methods  used  by  Lowe  [25]  in  his  iterative  solution  for  2-D 
to  3-D  object  recognition  are  also  appropriate  for  this  approach 
Lowe’s  method  is  also  a  Newton-Ralphson  technique  (just  like 
ours)  and  was  proven  to  be  very  stable  and  converged  to  the 
right  solution  in  almost  all  cases.  We  found  that  this  is  the  case 
With  our  method,  too.  The  good  convergence  is  probably  due  to 
the  viewpoint  and  time  constraints  discussed  earlier.  Therefore 
even  with  initial  guesses  that  are  ’’far”  from  the  correct  solution, 
the  methods  converge  to  the  correct  solution. 

We  use  a  set  of  noise-free  synthetic  images  to  demonstrate 
the  feasibility  and  speed  of  the  approach,  and  its  high  accuracy 
in  noise  free  conditions.  At  the  lime  of  the  publication  of  this 
report,  we  have  not  obtained  results  for  motion  sequences  from 
a  natural  environment.  We  expect  that  the  performance  will  not 
be  as  good  as  reported  here,  but  the  use  of  symbolic  features 
should  greatly  improve  the  robustness  of  the  method. 

In  Section  2  we  establish  the  relevant  set  of  equations,  in 
Section  3  we  give  a  method  of  solution,  in  Section  f  we  present 
preliminary  results  and  in  Section  5  we  discuss  further  work. 

2  Establishing  the  Equations 

In  this  section,  we  first  derive  motion  equations  for  general 
object  motion.  We  then  introduce  the  assumptions  of  object 
rigidity  and  constant,  linear  and  angular  velocity  and  solve  the 
simplified  equations  in  closed  form. 

Our  analysis  of  general  motion  [15]  is  slightly  different  from 
the  analysis  of  a  camera  moving  through  the  environment  [24], 
since  the  possibility  of  tracking  several  objects  is  kept  in  con¬ 
sideration.  Also,  our  method  of  solution  is  different  from  that 
of  Shariat  [34]  and  the  result  is  more  general.  We  integrate  the 
differential  equations  of  motion  and  derive  equations  that  are 
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simpler  and  more  powerful  since  they  contain  exact  and  explicit 
information  about  the  body  motion  and  they  are  not  restricted 
to  equally-spaced  time  images. 

The  camera  setup  and  notation  are  presented  in  Figure  1. 
For  the  purpose  of  clarity,  we  distinguish  between  3  coordinate 
systems:  a  reference  coordinate  system  with  the  center  at  point 
O  (camera-centered  coordinate  system);  an  intermediate  coordi¬ 
nate  system,  which  is  here  a  center-of-mass  coordinate  system 
with  the  origin  C  and  coordinates  (Xr,Yc,  Zc)\  and  a  body-fixed 
coordinate  system  (.Yj,  V)>,  Zj)  with  the  origin  also  at  the  point 
C.  The  center-of-mass  coordinate  system  is  translating  with  axes 
parallel  to  the  axes  of  the  camera-centered  coordinate  system  and 
the  body-fixed  coordinate  system  is  rotating  around  C.  The  in¬ 
termediate  coordinate  system  is  introduced  in  order  to  allow  the 
separation  of  rotational  motion  (and  perhaps  the  motion  inter¬ 
nal  to  the  body-fixed  coordinate  system)  from  the  rest  of  the 
motion.  The  choice  of  C  is  arbitrary  for  our  purposes,  and  any 
other  point  is  a  valid  choice  for  the  center  of  rotation.  Dry  is  the 
image  coordinate  system. 

The  position  of  an  arbitrary  point  P  on  a  moving  object  is 
characterized  according  to  Figure  1  by  vector  R(t )  (in  matrix 
notation) 

R(t)  =  C(t)  h  r(t)  (1) 

where  C(t)  is  the  current  position  of  C  and  r(t)  is  the  current 
position  of  tin  point  /  relauve  to  C  ,n  both  the  center-of-mass 
and  body-fixed  coordinate  systems.  Differentalion  of  Eq.  (I) 
gives  [15] 


/dC(0\  ( 

/  dr  (t)\ 

\  dt  /  r,xmcr(l 

\  dt  /  riimerii 

V  di  A 

(  I  is  the  contribution  to  the  speed  of  point  I’  coming 

from  the  motion  of  the  center-of-mass  coordinate  system  and  the 
term  u)  *  r  defines  changes  in  the  position  of  the  point  P  due  to 
the  instantaneous  rotation  w  of  the  body  around  C .  {The  term 
(  dr^  )  is  the  internal  velocity  of  the  point  P  in  the  body-fixed 

coordinate  system.  This  velocity  is  zero  if  the  body  is  rigid. 

We  now  introduce  the  assumptions  that  the  object  is  rigid, 
and  has  constant  translational  and  rotational  velocity.  The  equa¬ 
tion  of  motion  of  the  point  P  in  the  camera-centered  coordinate 
system  is  then  the  solution  of  the  following  pair  of  equations, 
describing  a  constant  translational  motion  V  of  the  origin  C  and 
a  constant  rotational  motion  u ;  of  the  point  P  around  C. 


V  dt  )  ,,imrr„ 

l  dr  (t)  \ 

V  dt 


If  the  translational  velocity  V  is  not  changing  in  time  then 
the  solution  of  Eq.  (3a)  is 


C(t)  C„  i  V  -  I 
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Figure  1:  Motion  of  arbitrary  rigid  body,  and  coordinate  systems. 


Co  is  the  initial  position  of  the  body-fixed  coordinate  system. 

Let  us  now  derive  the  solution  for  the  rotational  motion,  Eq. 
(3b).  For  constant  angular  velocity  the  solution  of  Eq.  (3b)  can 
be  found  in  several  ways.  3  We  rewrite  Eq.  (3b)  in  a  slightly 
different  form 


0  -<jjg 

0 

—  u)u  -bu>z 


is  the  matrix  specifying  the  rotation  of  the  body.  The  solution 
of  Eq.  (5a)  is 

r(t)  =  en,T0.  (6) 

rn  is  the  initial  position  of  the  point  P  in  the  body-fixed  coordi¬ 
nate  system.  Hc;\.,  we  make  an  important  observation  that  the 
matrix  fl  satisfies  the  relation 


where  ||  w  ||  is  the  norm  of  the  angular  velocity  cp  ( ,  ur- ) 
||  w  ||2  -  w? -t  w; u/f.  (7lr) 


This  observation  enables  ns  to  simplify  the  matrix  equation  (6). 
We  find,  using  Eqs.  (7),  that  the  solution  of  Eq.  (5)  is 


r(t)  =  [/  I  5(w,t)  n  +  C’(w,f)  ll2 


where  /  is  the  identity  matrix  and  we  introduce  shorthand  nota¬ 
tion 

or..  -  sin(||  w  ||  t) 


CV,0 


(1  -  cos(||  IP  |[  ()) 


Eqs.  (8)  are  similar  to  Rodrigues  formula  ( 1 9| .  The  use  of  func¬ 
tions  S(u),t)  and  C(u,t)  is  particularly  convenient  for  small  ro¬ 
tations,  when  these  functions  are  approximately  independent  of 

II  «  II- 

The  motion  of  a  point  P  of  a  rigid  body  moving  with  constant 
translational  velocity  V'  and  constant  rotational  velocity  with 
point's  start itig  position  Pa 


P< i  (-  ii  1 


is  then  derived  by  combining  the  solutions  for  translational  mo¬ 
tion  (Eq.  (4))  and  rotational  motion  (Eq.  (8a)): 

R(t)  =  Ro  +  V-t+  [s(w,0-n  +  C(w,<)-n2]  r0.  (10) 

n  is  given  by  Eq.  (5b)  and  S(w,f)  and  C(u,t)  are  given  by  Eqs. 
(8b)  and  (8c). 

Without  loss  of  generality  we  can  assume  that  Co  =  0,  i.e., 
that  the  center  of  rotation  and  the  camera  coordinate  system 
coincide  at  time  t  =  0.  4  With  the  assumption  ro  =  Ro  the 
equation  Eq.  (10)  becomes: 

R(t)  =  Ro  +  V  -t  +  [5(w,t)-fi  +  C(w,l)-fi2]  R„.  (11) 

This  equation  describes  helix-like  trajectory  of  the  point  P  mov¬ 
ing  with  constant  translational  and  rotational  velocity. 

There  are  9  parameters  in  Eq.  (11)  three  for  each  of:  the 
initial  position  of  the  point  Ro,  the  translational  velocity  1  ,  and 
the  rotational  velocity  ui.  The  non-linearity  in  ui  is  evident  from 
the  rotational  (third)  summand. 

Let  Q  =  (x(t),y(t))  be  the  central  projection  of  the  point  P 
on  the  image  plane.  The  components  of  R(t)  =  (X(t),Y(t),  Z(t)) 
satisfy  the  following  set  of  equations 

fX(t)-x(t)Z(t)  =  0  (12a) 

fY(t)-y{t)Z{t)  =  0  (126) 

where  f  is  the  focal  length  of  the  camera,  assumed  to  be  the  unit 
of  length  (/  =  1).  We  assume  that  Z(t)  #  0,  for  all  f.  Since  there 
are  9  unknown  parameters  we  need  at  least  5  images  to  determine 
the  motion  parameters  of  a  single  point,  P.  (Each  image  supplies 
two  equations  for  the  unknown  parameters.)  Input  parameters 
are  image  events  Qi  =  (i,,y.,L)  =  (z(fi) .  !/(<«))  f°r  some  ar¬ 
bitrary  times  i  =  0, 1 . 4.  Some  other  combination  of  the 

number  of  images  and  the  number  of  points  belonging  to  the  same 
rigid  body  can  be  used  as  long  as  there  are  enough  equations  to 
solve  for  the  unknown  parameters  However,  a  larger  number  of 
images  gives  a  more  reliable  prediction  of  motion,  and  is  thus 
preferred  to  a  large  number  of  points,  unless  there  is  a  danger  of 
feature  disappearance  during  the  time  interval  considered. 

3  Method  of  Solution 

In  this  section  we  use  a  generalized  version  of  Newton’s 
iterative  method  to  solve  the  equations  for  the  motion  parameters 
which  we  developed  in  the  previous  section  We  can  foresee  some 
of  the  advantages  of  the  proposed  solution,  Eq.  (11).  The  known 
type  of  non-linearity  makes  the  method  converge  fast  and  be  more 
stable.  It  is  quite  possible  that  there  is  an  especially  suitable 
numerical  procedure  for  this  type  of  non-linearity,  although  we 
have  not  found  one  yet. 

The  initial  position  of  the  point  P  is  determined  by  the  3 
parameters  X0,  Y0,  and  Z0.  Since  Eqs.  (12)  are  homogeneous 
equations  of  the  first  order  in  X(f),V(t),  and  Z(l),  the  solution 
is  determined  up  to  a  scale  factor  -  usually  referred  to  as  the 
loss  of  depth  during  the  central  projection.  We  can  set  the  scale 
factor  equal  to  an  arbitrary  constant,  Z,t ,  supplied  in  practice  by 
some  other  method  (e  g  .  from  laser  range  data).  T  he  first  image 
in  the  sequence  then  provides  enough  information  to  determine 


4The  search  f<-r  parameter?  i.1 
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the  two  unknown  parameters 

X0  =  x0Zo,  (13a) 

To  =  VqZo-  (136) 

Thus,  we  are  left  with  six  unknown  parameters  labeled  as  a 
group 

{  =  wz,uv,uz)  (14) 

for  which  we  need  at  least  three  more  images.  In  the  Ivork  pre¬ 
sented  here  we  were  not  concerned  with  the  determination  of  the 
structure  of  the  moving  body.  One  method  for  determining  the 
structure  can  be  found  in  the  work  by  Broida  and  Chellappa  (14). 
Their  method  results  in  eight  unknown  parameters  and  computes 
the  structure  of  the  moving  body. 

The  closed  form  of  the  solution  enables  us  to  simplify  the 
formulation  of  the  minimization  problem  in  that  the  error  be¬ 
comes  a  linear  function  of  point  coordinates.  Namely,  assuming 
no  over-determined  system  of  equations,  we  can  use  a  generalized 
Newton  iterative  procedure  (37]  in  the  form 

■/(£„)  =  (Ml  -&,)•«(*„),  «  =  0,1,...  (15) 

where  the  error  of  the  solution  is 

(*(*i)-*i2(«i) 

X(t2)  -  x2Z(t2) 

W£)  =  X(l3)  -  x3Z(t3)  . 

Y(U)-yxZ(U)  ^ 

Y(h)  -  ytZ(h) 

Y(t3)  ~  y3Z(t3) 

£„+i  is  a  new,  better  set  of  values  for -the  unknown  motion  pa¬ 
rameters.  J(£n)  is  the  Jacobian  matrix 


MU  = 


I  ,j  =  1,  . . ,  6 


which  is  easily  computable  from  Eqs.  (16).  Note  that  since  we 
already  know  the  explicit  dependence  of  the  coordinates  of  P  on 
the  motion  parameters,  we  do  not  have  to  minimize  expressions 
of  the  form 

i  TO  _  / , 

Z(t,)  '  (  } 

often  found  in  optic  flow  (differential)  computations.  By  for¬ 
mulating  the  minimization  problem  in  this  way,  Eqs.  (10)),  we 
avoid  additional  non-linearity  of  equations  and  are  headed  for 
faster  and  more  stable  convergence. 

4  Experimental  Results 

In  this  section  we  give  preliminary  results  for  experiment  a1 
runs  designed  to  test  the  ideas  and  feasibility  of  our  approach. 
The  results  demonstrate  the  generality  of  the  method  and  its 
very  fast  convergence. 

The  algorithm  is  implemented  in  Common  Lisp  ;on  a  TI- 
Explorer.  The  algorithm  takes  only  a  few  seconds  to  compute 
motion  parameters.  The  sequence  of  images  in  Figure  2  shows 
centrally-projected  trajectories  after  each  step  in  Newton’s  it¬ 
erative  procedure  for  a  point  moving  with  constant  l.near  and 
angular  velocity.  In  each  of  the  images  in  Figure  2  there  are 
three  different  types  of  trajectories.  One,  labeled  with  squares, 
represents  the  correct  (g°al)  trajectory.  Each  square  is  centered 
around  the  image  coordinates  (x,,t/t),  of  the  point  moving  with 
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Four  iter  Alive  alep«  in  the  recovery  of  motion  parameters  for  the  trajectory  labeled 
by  aquarea.  Crosses  label  initially  guessed  trajectory  (0,  triangles  label  trajectories 
after  the  step  a  (trajectories  a  «=  1,2, 3,4)  of  t!.e  Newton  iteration  method 
described  in  text  .  Squares  label  the  correct  solution 


Figure  2:  <  ’otivorgonre  of  It«»r;ilivc  Method  ill  Recovery  of  Motion  Parameters. 


the  correct  set  of  motion  parameters  £  :  (V ;  o>)  Image  events 
Qt  ~  (x,,  y,,  ft),  I  0,  1,2,3,  are  provided  as  input  parameters 
to  the  iterative  algorithm.  An  initial  guess  for  motion  parameters 
£o  produces  the  trajectory  labeled  with  crosses.  The  trajectory 
which  represents  the  motion  with  improved  motion  parameters 
£„,  3  =  1,2,3,  .  ..  (after  each  step  8  of  iteration)  is  labeled  by 
triangles.  Note  that  the  initial  position  of  the  point  fi o  is  found, 
according  to  Eqs.  (13),  from  the  first  pair  of  image  coordinates 
(ro,  j/n ) -  This  is  the  reason  that  all  the  trajectories  start  from  the 
same  initial  point  (labeled  as  ”0”). 

To  demonstrate  the  generality  of  the  method,  we  choose  to 
detect  motion  of  an  arbitrary  linear  and  angular  velocity.  Figure 
2  illustrates  experiments  which  detect,  motion  of  a  point,  on  a 
body  moving  with  linear  velocity  V  (0. 1 ,0.2,0. 3)  and  angular 
velocity  u )  -  (0.3,  0.2, 0.2),  i.e., 

£„,  (0  1,0  2,0.3;  0.3,  0.2, 0.2). 

The  unit  of  length  /  —  !  and  the  unit  of  time  is  I 

Convergence  to  the  correct  solution  is  highly  dependent,  on 
the  choice  of  the  initial  guess  Even  a  very  rough  and  easily 
obtainable  estimate  of  the  motion  parameters  (e  g  ,  the  sign  of 
the  z-component  of  the  linear  and  angular  velocity)  drastically 
reduces  the  amount,  of  search.  As  can  be  seen  in  Figure  2,  our 
initial  guess  £0  (0.0, 0.3, 0.4;  0.2, 0.0, 0.2),  is  "far”  from  the 

correct  solution 

116,  6..  11/116.^11 

but  has  the  same  sign  of  the  curvature  and  direction  of  mot  inn 
in  one  direction.  After  the  first  iteration  we  get 


step  1  :  £j  '  (0.21,0.42,0.48;  0.50,  0.36,0.25).  (Figun  2 n) 

This  is  shown  as  the  trajectory  labeled  wit  h  t  riangles  in  Figure  2a. 
In  each  of  the  subsequent  figures  we  show  the  current  trajectory 
(triangles)  after  the  second,  third,  and  fourth  iteration,  together 
with  the  initially  guessed,  and  the  correct,  t  rajectory 

step  2  :  £2  (0.07,0.25,0.47;  0.33,  0.25,0.25),  [Figure  2b) 

step  3:  £3.-  (0.09.0.21,0.30;  0.31,  0.20,0.21),  ( figure  2c) 

step  4  :  £4  =  (0.099, 0.201, 0.300;  0.300,  0.200,0.201). 

( Figure  2d) 

Thus,  after  only  4  iterations  the  relative  error  in  the  motion 
parameters  is 

1164  6^,11/116^11  0.3%. 

Note,  in  particular,  that  components  VT  and  uv  were  first  initial¬ 
ized  to  0  but  are  still  correctly  computed.  (Jood  convergence  was 
also  found  for  many  other  types  of  motions  and  initial  guesses 

5  Discussion 

As  expected  for  any  set  of  non-linear  equations,  there  are  ini¬ 
tial  guesses  for  which  the  convergence  is  poor  anti  the  expected 
solution  is  not  fount!  The  situation  ran  be  compared  with  finding 
the  roots  of  the  equation  z'*"  I  0  in  t  he  complex  plane.  De¬ 
pending  on  the  initial  guess  any  of  the  three  roots  can  be  found 
Although  the  solution  is  usually  the  root  which  is  the  closest  to 
the  initial  guess,  there  are  some  cases  where  a  different  root  is 
fount!. 
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The  problem  can  be  sometimes  resolved  using  over-constrained 
systems  (more  image  events  than  necessary).  We  are  currently 
working  on  exploring  the  relation  between  the  type  of  motion 
and  the  value  of  the  initial  guess,  in  order  to  devise  appropriate 
initial  guessing  schemes. 

In  a  forthcoming  paper  we  will  present  a  detailed  analysis  of 
the  algorithm  for  both  synthetic  and  real  images,  and  for  sev¬ 
eral  types  of  motion.  We  are  presently  exploring  a  number  of 
important  issues,  before  applying  the  method  to  a  sequence  of 
real  images.  One  problem  is  the  feature  correspondence  over 
several  images.  A  larger  distance  between  image  features  will 
better  define  the  trajectory  and  recover  motion  parameters  more 
accurately.  Unfortunately,  distant  features  are  more  difficult  to 
correlate  between  images.  A  strength  of  our  method  is  that  it 
allows  arbitrary  time  intervals  between  successive  images.  Allow¬ 
ing  larger  time  intervals  between  images  facilitates  the  separation 
of  different  motions  with  similar  projections. 

Another  issue  is  the  sensitivity  of  the  solution  to  the  posi¬ 
tional  error  of  the  image  features.  The  idea  of  smoothing  1 30] 
applied  to  the  time  coordinate,  together  with  error  analysis  simi¬ 
lar  to  (14,26,35]  could  be  used  as  a  promising  start.  The  Newton 
iterative  method  is  easily  generalized  to  over-determined  systems, 
so  as  to  insure  greater  robustness  to  noise.  A  related  problem  is 
the  non-uniqueness  of  solutions  of  the  non-linear  equations.  This 
is  the  case,  for  example,  when  the  trajectory  has  a  much  larger 
||  u  ||  than  expected,  but  passes  through  the  same  space-time 
events,  a  phenomenon  that  could  be  called  the  "stroboscopic”  ef¬ 
fect.  5.  In  a  sense  it  is  an  undersampling  in  the  time  domain,  that 
could  be  corrected.  Over-determined  systems  have  a  potential  to 
solve  this  problem  as  well.  If  we  can  assume  that  angular  velocity 
is  small,  then  the  solution  space  of  the  iteration  procedure  can 
be  restricted  and  multiple  solutions  can  be  avoided. 

If  the  rotational  frequencies  are  very  high,  a  completely  dif¬ 
ferent  problem  arises:  it  becomes  increasingly  hard  to  track  the 
motion  of  a  single  feature  due  to  its  periodic  disappearance.  Our 
approach  is  suitable  for  high  frequencies  since  it  does  not  require 
the  constant  time  interval  between  the  time  frames.  However, 
some  estimation  techniques  of  the  feature  positions  will  have  to 
be  incorporated. 

We  also  plan  to  investigate  the  motion  of  several  objects  si¬ 
multaneously,  as  well  as  that  of  several  features  on  a  single  object. 
Tracking  of  several  features  on  a  single  object  can  produce  an  es¬ 
timate  of  the  3-D  structure  of  the  object  or  it  can  be  used  to 
facilitate  motion  segmentation.  Another  very  interesting  direc¬ 
tion  for  investigation  is  the  derivation  of  equations  (in  a  manner 
similar  to  the  one  presented  in  this  paper)  for  structures  more 
complex  than  a  point,  i  e. ,  to  lines,  planes,  and  other  surfaces  of 
a  rigid  body  [25,36,40].  Finally,  using  perturbation  techniques 
these  equations  can  be  generalized  to  motions  of  bodies  with 
slowly  changing  shape,  and/or  slowly  changing  motion  parame¬ 
ters. 

5.1  Conclusion 

In  conclusion  we  have  used  a  spatio-temporal  analysis  of  image 
events  to  present  a  quite  general,  robust  and  computationally 
efficient  method  for  the  recovery  of  motion  parameters  of  mov¬ 
ing  objects  under  the  formulation  posed  by  Shariat  [34]  of  con¬ 
stant  motion.  The  closed  form  solution  (containing  exact  and 
explicit  information  about  the  body  motion)  has  other  advan¬ 


tages:  solution  strategies  can  be  adapted  to  the  known  type  of 
non-linearity  resulting  in  faster  and  more  reliable  convergence; 
th"  initial  guessing  scheme  can  exploit  constraints  derived  from 
the  proposed  solution;  a  generalization  to  more  images,  more  fea¬ 
tures,  and/or  variable  time  intervals  between  images  is  readily 
available;  and,  finally,  the  constraint  of  constant  motion  param¬ 
eters  can  be  relaxed,  using  perturbation  techniques  for  slowly 
varying  parameters. 
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Abstract 

We  address  the  problem  of  recognition  of  animal  mo¬ 
tion  from  pure  motion  cues.  A  graphical  representa¬ 
tion  for  animal  gait  is  introduced,  and  its  conncction- 
ist  implementation  given.  Assuming  principal  views, 
a  method  of  indexing  into  this  gait  representation 
from  low  level  invariants  is  described.  Preliminary 
results  from  the  first  implementation  are  presented. 

The  approach  is  motivated  and  constrained  by  bio¬ 
logical  considerations. 

I.  Introduction 

The  human  visual  system  lias  a  remarkable  ability  to  dis¬ 
criminate  between  difleient  types  of  movement.  The  clas¬ 
sic  illustration  of  this  ability  is  Johannson’s  Moving  Light 
Display  (MLD)  [Johansson,  1073,  Johansson,  107C>[.  Re¬ 
flective  pads  were  placed  at,  the  joints  of  an  actor  dressed  in 
black,  and  the  actor  illuminated.  Films  were  taken  of  the 
actor  walking,  jumping  and  making  various  other  move¬ 
ments  against  a  black  backdrop.  When  these  films  were 
shown  to  subjects,  they  all  recognized  the  display  to  be  of 
a  person  walking,  jumping,  c/c.,  but  reported  single  frames 
to  be  meaningless  patterns  of  dots.  The  time  required  for 
this  act  of  recognition  was  on  the  order  of  a  perceptual 
event:  a  presentation  time  of  no  more  than  200  msec  was 
sufficient  for  all  subjects  to  make  the  correct  discrimina¬ 
tion.  Further  experiments  [Koz.iowski  and  Cutting,  1077, 
Cutting  and  Ko/.lowski,  1077]  demonstrated  the  sensitivit  y 
of  this  faculty:  subjects  could  determine  the  actor’s  gen 
der,  and  could  even  identify  the  actor  if  (s)lie  was  known 
to  the  subject,. 

This  paper  reports  the  early  results  of  an  attempt  to 
produce  a  computational  account  of  this  capability,  con 
sistent  with  the  psychological  and  neurophysiological  lit¬ 
erature.  A  crucial  aspect  of  the  model  is  the  representa¬ 
tion  and  processing  of  temporal  factors.  We  assume  an 
input  representation  that  could  reasonably  be  expected  to 
lie  available  from  lower  levels  in  the  visual  pathway.  The 
representation  of  memory  and  the  mechanism  for  index¬ 
ing  into  it  is  expressed  as  a  connect ionisl  network.  The 
output  is  a  pattern  ol  activation  over  a  small  part  of  the 
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memory  network,  indicating  which  object/movement  has 
been  recognized. 

II.  Motion  Processing  in  Primate 
Visual  Systems 

Visual  motion  is  in  essence  unlike  other  visual  information. 
Color,  stereopsis,  texture  and  brightness  are  all  static  qual¬ 
ities,  which  we  can  experience  immediately.  Visual  motion 
is  a  temporal  quality.  Although  our  experience  of  it  seems 
immediate,  it  has  sometimes  been  treatixl  as  a  perception 
constructed  from  memory  of  a  sequence  of  images.  My  now 
it  is  char  that  motion  is  in  fact,  a  fundamental  biological 
sense  (for  overviews,  see  [Nakayama,  l!)8.r>]  and  (Maunsell 
and  Newsome,  ]!)87]). 

There  are  two  obvious  ways  the  motion  information 
available  in  the  Johannson  experiments  could  he  used  to 
generate  the  percepts  of  prison  and  walking ,  or  indeed  t  he 
single  percept  of  walking  prison.  The  first,  method  would 
he  to  use  the  motion  information  to  index  directly  into 
memory,  implying  a  memory  representation  rich  in  tempo¬ 
ral  information.  This  method  places  motion  information 
in  a  central  position  vis-a-vis  the  recognition  process.  The 
second  method  would  use  the  motion  information  to  re¬ 
construct  various  static  qualities  of  the  scene  object,  (such 
as  structure),  and  use  those  static  qualities  to  index  into 
memory  and  recognize  the  object.  Having  recognized  the 
object,  the  motion  of  various  key  parts  of  t  he  object,  could 
be  used  to  discriminate  between  gaits.  In  this  second 
method,  the  motion  information  is  used  in  two  ways:  to  re¬ 
cover  static  qualities;  and  to  disambiguate  a  small  number 
of  gaits. 

Ill  this  paper  we  address  the  first,  method,  but  do  not 
rule  out.  the  second.  The  underlying  view  is  that  the  vi¬ 
sual  system  ran  use  multiple  sources  of  informal  ion  which 
are  processed  in  parallel  by  separate  but  interacting  sys¬ 
tems.  No  one  source  or  system  is  vn/u irul  for  recognition: 
each  one  alone  may  be  snjjicit  nl.  F.acli  recognition  pro 
cess  indexes  into  a  memory  structure  tailored  to  the  type 
of  informal  ion  it.  uses,  t  he  different  mnnorv  st  ru<  t  lues  for 
the  same  object  being  linked.  Although  pure  motion  infor 
mation  may  also  lx  •  used  to  derive  struct  ural  information 
which  in  turn  may  drive  the  form  recognition  process,  we 
argue  here  for  a  recognition  process  operating  direct  Iv  <>n 


the  motion  information  and  indexing  into  a  memory  struc¬ 
ture  devoted  to  the  representation  of  motion. 

Such  a  motion-specific  process  and  memory  structure 
must  play  a  role  in  MLD  experiments.  MLD’s  presented 
upside  down  are  not  recognized,  whereas  moving  stick 
figures  presented  upside  down  are  recognized  [Feldman, 
1988a].  If  the  hypothesized  structure-from-motion  process 
is  independent  of  memory,  it  should  work  equally  well  with 
upside  down  MLD’s,  deriving  upside  down  stick  figures 
that  are  recognizable.  Since  these  upside  down  MLD’s  are 
not  recognizable,  either  there  is  a  separate  motion  recog¬ 
nition  process,  or  the  structure-from-motion  process  is  de¬ 
pendent  on  top-down  feedback,  which  would  imply  a  mem¬ 
ory  structure  rich  in  motion  information.  Brain  damaged 
patients  provide  further  evidence  for  separate  processes. 
Lesions  to  the  temporal  lobe  can  lead  to  the  inability  to 
identify  faces,  while  leaving  intact  the  ability  to  identify 
from  body  motion  [Damasio  et  al.,  1982];  and  lesions  to 
parietal  cortex  can  impair  recognition  from  body  motion 
while  leaving  object  recognition  unimpaired. 

III.  Computational  Issues 

As  with  any  biological  recognition  problem,  the  fundamen¬ 
tal  computational  issues  are:  what  are  the  tractably  com¬ 
putable  invariants  in  the  input?;  what  is  the  memory  rep¬ 
resentation  for  ecologically  significant  items  in  terms  of 
these  invariants?;  and  how  are  the  input  invariants  used  to 
index  into  the  memory  representation? 

We  have  already  established  that  the  representation 
of  animal  gait  in  memory  will  be  complete  in  itself;  it  will 
not  presume  any  particular  representation  of  the  static  ob¬ 
ject.  It  is  not  possible  however  to  describe  the  movement 
of  a  non-rigid  object  without  at  least  implicitly  describing 
the  structure  of  the  object.  We  will  refer  to  a  sequence 
of  movements  of  co-ordinated  parts  of  an  animal  as  a  sce¬ 
nario,  using  Feldman’s  term  in  a  restricted  fashion  [Feld¬ 
man,  1988b]. 

Animal  motion  is  not  arbitrary.  Across  a  wide  range 
of  animals  it  can  be  characterized  by  pendular  motion: 
the  upper  arm  rotates  about  the  shoulder,  the  lower  arm 
about  the  elbow,  the  hand  about  the  wrist,  etc.  We  may 
abstract  the  skeletally  baspd  motion  as  that  of  a  jointed 
stick  hgure.  Moreover  animal  motion  is  generally  planar. 
When  walking  or  running,  most  limb  movements  are  in 
the  plane  whose  perpendicular  is  defined  by  the  direction 
of  gravity  and  the  direction  of  overall  motion.  Dancing 
is  an  example  of  a  human  movement  that  is  truly  three 
dimensional. 

A.  Restricting  the  Problem  Space 

We  make  the  following  assumptions.  The  motion  to  be 
recognized  is  that  of  an  articulated  stick  figure  with  bright 
spots  at  the  joints,  moving  parallel  to  the  image  plane  and 
viewed  orthogonally,  (f  the  trunk  of  the  figure  is  moving, 
we  assume  the  imaging  system  is  tracking  the  center  of 
rotation  of  the  trunk,  so  that  in  the  image  the  trunk  is 


undergoing  pure  rotation.  The  limbs  are  rotating  about 
an  end  of  the  trunk,  and  so  on.  Thus  we  have  a  movement 
which  can  be  completely  described  by  the  length  of  each 
stick  in  the  figure,  and  the  change  in  angle  at  each  joint 
over  time.  We  shall  assume  that  the  change  in  joint  angle 
over  time  is  piecewise  linear,  i.e.  a  sequence  of  segments 
of  constant  angular  velocity. 

We  have  now  delimited  the  class  of  movements  in  such 
a  way  that  a  complete  representation  is  possible.  Starting 
with  the  trunk,  we  assign  a  level  ro  each  part  (stick)  of  the 
figure.  The  level  is  given  by  the  number  of  joints  between 
the  part  and  the  trunk.  Thus  a  thigh  would  be  at  level 
one  and  a  calf  at  level  two.  Then  the  motion  of  the  figure 
is  described  by  the  motion  of  each  joint  between  a  part  at 
a  lower  level  and  a  part  at  the  next  highest  level.  Each 
of  these  joints  undergoes  a  sequence  of  constant  angular 
velocity  changes,  which  for  biological  movements  such  as 
walking  is  periodic.  The  fundamental  motion  event  under 
these  assumptions  is  a  change  in  angular  velocity.  The  set 
of  sequences  together  with  information  co-ordinating  them 
describe  the  motion  completely:  sufficiently  to  unambigu¬ 
ously  regenerate  it.  Such  a  set  of  sequences  of  events  con¬ 
stitute  a  scenario. 

A  sample  abstract  scenario  is  given  in  Figure  1.  In 
this  diagram  the  column  on  the  right  (b)  shows  an  ab¬ 
stract  stick  figure  initially  in  the  shape  of  the  numeral  four, 
undergoing  periodic  motion.  Over  the  course  of  ten  time 
steps,  as  noted  by  the  column  on  the  left,  it  articulates  a 
cyclical  motion,  the  horizontal  stick  (the  “trunk”)  remain¬ 
ing  stationary  and  the  other  three  sticks  rotating  about  its 
ends.  The  middle  three  columns  each  show  one  of  these 
three  sticks  rotating  about  the  horizontal  stick.  The  solid 
line  drawings  indicate  an  event  occurring,  the  dotted  line 
drawings  are  merely  for  illustration.  In  this  scenario,  there 
are  three  parts  each  of  which  undergoes  four  events  in  the 
ten  step  period.  At  time  step  10,  the  object  is  back  in  its 
original  shape,  ready  to  repeat  the  cycle  once  again. 


Figure  2:  scenario  graph 


Figure  1:  folding-four  scenario 


B.  Low-level  visual  invariants 

What  kind  of  input  data  ran  we  expect  to  be  available  for 
indexing  into  tile  gait  memory?  The  stimulus  is  a  moving 
pattern  of  small  bright  spots  against,  a  dark  background. 
VVr  assume  a  mid-level  process  computing  the  Visual  Vec¬ 
tor  Analysis  [Johansson,  1973],  giving  as  output  an  encod¬ 
ing  of  a  moving  stick  figure.  The  vector  analysis  process 
continuously  computes,  for  each  joint,  the  angle  and  an¬ 
gular  velocity  at  that  joint.  For  example  the  angle  and 
angular  velocity  of  the  lower  leg  relative  to  the  upper  leg 
would  be  continuously  available.  As  in  the  related  Tinker 
Toy  recognition  project  [Cooper  and  llollbarh.  19.S7],  we 
assume  a  principal  views  treatment  and  features  which  are 
invariant  to  scale.  As  viewpoint  changes,  so  that  motion 
is  no  longer  parallel  to  the  image  plane,  these  angular  po¬ 
sition  and  velocity  cues  vary  little  and  can  be  considered 
invariant  for  the  purposes  of  indexing. 

Why  should  we  assume  t  hcse  particular  cues  for  index 
mg'  We  have  ample  evidence  of  cells  i  lined  for  oriental  ion 


[Allman  ct  ai,  1985,  Rodman  and  Albright,  1988].  Com¬ 
bining  the  output  of  two  or  more  of  these  would  make  the 
angle  at.  the  joint  available.  There  are  cells  sensitive  to  ro¬ 
tation  [Saito  el  al. ,  1986,  Sakata  el  al. ,  1985],  although  not 
sufficiently  highly  tuned  for  particular  velocities.  However 
the  output  of  several  broadly  tuned  cells  can  be  combined 
to  achieve  finer  tuning.  The  strongest  evidence  for  motion 
cues  based  on  velocity  is  that  people  can  estimate  posi¬ 
tion  and  velocity  well  for  a  variety  of  tasks,  but  further 
derivatives  (acceleration,  rlr.)  poorly. 

C.  Representing  Scenarios 

A  simple  graphical  representation  follows  from  the  specifi¬ 
cation  of  a  scenario  given  above.  We  represent  each  event, 
or  point  (in  time)  when  the  angular  velocity  changes,  by 
a  graph  node.  The  nodes  are  labeled  with  the  new  an¬ 
gular  velocity  and  the  absolute  angle  of  the  joint  at  that, 
time.  Directed  edges  between  nodes  represent  sequence, 
each  edge  being  labeled  with  the  time  between  the  two 
nodes.  Each  sequence  is  represented  by  such  a  graph.  The 
graph  is  cyclical  if  the  sequence  is  periodic.  The  graphs  for 
the  sequences  are  linked  with  directed  edges  that  specify 
the  co-ordination  between  the  sequences,  using  labels  on 
the  edges  as  before. 

Figure  2  shows  I  he  scenario  graph  for  our  folding-four 
example  described  previously  in  Figure  1.  Row  (a)  shows 
the  three  moving  parts  of  the  object  and  the  direction  ill 
which  angle  and  angular  velocity  are  measured.  Rows  (b) 
and  (c)  show  the  coordination  within  and  between  the 
three  sequences.  For  clarity,  each  event  node  is  labeled 
by  the  time  step  at  which  it  occurs.  The  'lotted  boxes 
are  for  illustrative  purposes  only.  I  lie  event  nodes  in  the 
leftmost  boxes  are  associated  with  joint  #  1 ,  those  in  the 
middle  boxes  with  joint  #2.  and  so  on.  Row  ( ! * )  shows 
the  unidirectional  edges  indicating  sequentiality,  and  row 


(c)  shows  the  bi-directional  edges  indicating  simultaneity. 
Each  event  node  shown  in  (b)  is  repeated  for  clarity  in 
(c).  The  edges  are  not  labeled  with  their  associated  time 
delay;  it  is  the  difference  between  source  and  destination 
node  times. 

We  have  assigned  graph  nodes  to  those  points  in  time 
when  something  significant  is  happening,  namely  a  change 
in  angular  velocity  for  one  of  the  joints.  This  covers  the 
significant  points  for  pure  motion  information  (given  our 
assumptions  above).  But  even  with  simple  stick  figures 
there  are  moments  during  the  motion  which  are  significant 
cues,  but  significant  for  reasons  other  than  motion.  The 
simplest  example  for  stick  figures  is  co-linearity  -  when  the 
joint  angle  is  0  or  180  degrees.  For  richer  images  includ¬ 
ing,  for  example,  color  and  shading,  we  may  expect  many 
more  of  these  non-motion  significant  points.  These  are  also 
events  and  can  be  included  as  new  nodes  in  the  graphical 
representation.  The  difference  is  that  these  new  events  are 
not  labeled  with  angular  velocity  but  with  color,  shading, 
etc.  In  this  paper  we  will  ignore  such  non-motion  events. 

A  problem  with  this  representation  is  that  it  does  not 
allow  for  time-scaling.  However  if  we  re-interpret  the  edge 
labeling  so  that  the  time  differences  between  nodes  are 
not  absolute  but  assumed  to  be  a  multiple  of  some  time 
interval,  and  we  have  some  way  of  determining  the  base 
time  interval,  then  a  single  graph  will  represent  a  scenario 
executed  at  any  speed.  This  is  an  avenue  to  be  explored 
in  the  future. 

The  graphical  representation  developed  above  is  natu¬ 
rally  implemented  as  a  connectionist  network2 .  Each  graph 
node  is  represented  by  a  unit,  and  each  directed  labeled 
edge  by  a  link  with  an  associated  time-delay.  The  units 
have  a  site  for  priming  activation  which  arrives  along  these 
delay  links,  and  another  site  for  input  from  the  lower  lev¬ 
els  of  the  visual  system.  Initially  all  units  receive  a  small 
amount  of  priming  activation.  Units  expect  activation  to 
arrive  at  both  sites  simultaneously.  If  priming  activation 
or  input  activation  arrives,  but  not  both,  then  the  events 
in  the  image  are  not  corresponding  to  the  scenario  repre¬ 
sented.  If  the  image  events  do  indeed  correspond  to  the 
scenario  represented,  then  priming  activation  should  flow 
through  the  network,  building  up  as  it  does  so.  For  peri¬ 
odic  motions  this  activation  should  saturate  quickly.  Fig¬ 
ure  2  is  easily  re-interpretable  as  a  connectionist  network. 

A  scenario  is  recognized  when  activation  flows  around 
the  network  representing  it.  It  is  a  simple  matter  to  attach 
a  network  to  the  scenario  network  to  detect  when  and  how 
strongly  act  ivation  is  flowing  through  the  scenario  network. 
I  he  output  of  this  evaluation  network  is  a  measure  of  how 
similar  the  input  is  to  this  particular  scenario.  The  details 
of  this  evaluation  network  are  not  given  here,  but  would 
be  important  in  any  quantitative  study. 


2  Ill  our  networks  units  have  one  or  more  sites  at  which  links  arrive, 
amt  where  input  activation  is  processed.  This  enables  <1 1 ITereii t I.il 
treatment  of  inputs. 


D.  Indexing 

Assuming  the  input  described  above,  i.e.  at  each  time  step, 
for  each  joint  in  the  image,  a  readout  of  the  angle  and  an¬ 
gular  velocity  at  that  joint,  how  do  we  index  into  scenario 
memory?  We  would  like  the  time-complexity  of  the  index¬ 
ing  algorithm  to  be  independent  of  the  number  of  scenarios 
stored  in  memory.  However  the  time-complexity  may  be 
dependent  on  the  complexity  of  the  scenarios,  where  com¬ 
plexity  is  loosely  described  by  the  number  of  body  parts 
in  motion  and  the  number  of  events  the  parts  transit.  We 
would  like  the  indexing  algorithm  to  be  tolerant  of  missing 
data  points  (for  example,  due  to  occlusion),  and  to  in¬ 
crementally  converge  on  the  correct  scenario  as  more  and 
more  data  arrives.  At  the  same  time  we  must  avoid  expo¬ 
nential  growth  in  the  number  of  units  and  links  required 
as  the  number  of  scenario  memories  increases.  We  would 
also  like  to  be  able  to  take  advantage  of  evidence  based 
on  structural  or  other  static  qualities  of  the  object  if  it  is 
available. 

Using  units  to  detect  changes  in  angular  velocity,  our 
input  may  be  easily  discretized  into  a  set  of  sequences  of 
events  at  which  angular  velocities  change  for  each  joint. 
These  sequences  are  exactly  analogous  to  the  sequences 
of  events  represented  by  the  nodes  in  the  scenario  graph. 
For  a  given  scenario  we  must  match  the  input  sequences 
against  the  stored  sequences  to  determine  which  input  se¬ 
quence  corresponds  to  which  stored  joint  sequence.  Not 
only  must,  a  mapping  from  input  to  scenario  be  established, 
the  co-ordination  between  input  sequences  must  match  the 
co-ordination  between  joint  sequences  in  the  scenario.  We 
must  perform  this  match  for  each  scenario  memory.  If  we 
assume  a  solution  to  the  first  problem  (matching  a  par¬ 
ticular  scenario  against  the  input),  then  we  can  achieve 
recognition  time  independent  of  the  number  of  scenarios 
stored  in  memory  at  a  cost  of  linear  increase  in  the  num¬ 
ber  of  units  and  links:  we  match  against  all  scenario  mem¬ 
ories  in  parallel.  This  is  trivial  to  do  in  a  connectionist 
network;  we  simply  duplicate  the  matching  machinery  for 
each  scenario. 

E.  The  Correspondence  Problem 

Solving  the  correspondence  problem  is  harder.  We  can¬ 
not  wait  until  we  have  all  the  data  before  attempting  to 
match.  This  must  also  be  an  incremental  process  over 
time.  Our  approach  is  to  attempt  to  match  all  input  se¬ 
quences  against  all  stored  sequences  in  parallel.  Again  it 
is  trivial  to  achieve  parallel  matching  in  a  connectionist 
network,  if  one  is  willing  to  pay  the  price  in  terms  of  the 
number  of  units  and  links  required.  If  n  is  the  maximum 
number  of  joint,  sequences  for  any  one  gait,  then  we  will 
require  n-squared  binding  networks  for  each  gait  in  order 
to  match  everything  in  parallel.  Since  typically  the  stick 
figures  have  only  a  few  joints  (less  than  10),  this  is  not  loo 
high  a  price  to  pay. 

Figure  3  gives  a  schematic  outline  of  the  architecture 
we  adopt  to  solve  the  correspondence  problem.  This  dia- 
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Figure  3:  indexing  architecture 

gram  shows  our  example  abstract  scenario  in  memory  in 
the  left  hand  column  (the  dotted  boxes  correspond  to  those 
in  Figure  2).  The  input  modules  are  shown  in  the  top  row. 
The  n  by  n  grid  of  binding  networks  for  matching  is  shown 
in  the  middle.  For  each  scenario  there  is  a  separate  scenario 
memory  network  and  another  set  of  binding  networks.  The 
input  modules  are  not  duplicated,  the  single  set  providing 
input  for  all  binding  networks. 

The  input  is  a  time-varying  pattern  of  activation  over 
the  set  of  input  modules.  There  is  one  input  module  for 
each  joint  in  the  scene  (see  Figure  4  for  details).  Each  input 
module  consists  of  a  set  of  angle  and  a  set  of  angular  veloc¬ 
ity  units.  The  angle  is  that  between  the  major  and  minor 
parts  at  the  joint,  and  the  angular  velocity  its  derivative 
with  respect  to  time.  Thus  the  time-varying  pattern  at 
an  input  module  should  match  the  expected  pattern  rep¬ 
resented  in  one  of  the  scenario  sequences  (#1,  #2  or  #3  in 
the  folding-four  example).  The  scenario  closest  to  the  mo¬ 
tion  in  the  scene  is  found  by  matching  each  input  module 
against  each  scenario  sequence.  Not  only  must  each  ac¬ 
tive  input  module  match  a  sequence,  all  active  input,  mod¬ 
ules  must  match  different  sequences  in  the  same  scenario, 
and  the  co-ordination  between  events  in  the  different  input 
modules  must  match  the  co-ordination  between  sequences 
in  the  scenario. 

Keeping  in  mind  that  our  correspondence  problem  is 
to  match  input  modules  against  scenario  sequences,  to  per¬ 
form  the  comparison  in  parallel  we  use  the  grid  of  bind¬ 
ing  networks,  shown  in  Figure  3.  The  input  modules  pass 
events  to  the  binding  networks,  which  pass  them  on  with 
varying  degrees  of  activation  to  the  scenario  sequences.  At 
the  same  time  the  binding  networks  are  comparing  the 
activation  from  the  input  modules  with  the  activation  in 
the  scenario  sequences.  Any  particular  binding  network  is 
connected  to  one  input  module  and  one  scenario  sequence. 
If  the  activation  from  both  correspond  (i.e.  arrive  at  the 
same  time),  then  the  strength  of  binding  increases;  other- 
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Figure  4:  binding  network  details 

wise  it  decreases.  If  the  input  scene  corresponds  to  the 
scenario,  then  over  time  a  consistent  set  of  binding  net¬ 
works  will  become  active,  all  others  becoming  inactive;  and 
the  scenario  itself  will  become  very  active.  If  the  match  is 
partial,  parts  of  one  or  more  sets  of  consistent  binding  net¬ 
works  will  become  active  to  some  degree;  and  the  scenario 
will  become  active  to  some  degree. 

Figure  4  shows  the  details  for  one  binding  network. 
At  the  top  is  an  input  module.  It  has  a  set  of  units  tuned 
to  a  particular  angular  range  and  another  set  tuned  to  par¬ 
ticular  angular  velocity.  Unit  E  is  an  event  detecting  unit, 
connected  to  all  the  velocity  tuned  units  in  the  module;  it 
fires  when  the  angular  velocity  changes,  our  definition  of 
an  event.  At  the  left  of  Figure  4  is  a  sequence  network, 
sequence  #1  from  Figure  2.  In  the  middle  is  a  binding 
network. 

How  do  these  binding  networks  work?  A  binding  net¬ 
work  performs  two  functions:  it  passes  on  input,  events 
from  its  input  module  to  its  sequence  network;  and  it  com¬ 
pares  the  events  occurring  in  the  input  module  with  those 
occurring  in  the  sequence  network.  Events  are  differenti¬ 
ated  by  the  angle  and  the  angular  velocity  at  the  joint, 
and  occur  when  the  angular  velocity  changes.  For  each 
event  represented  in  the  sequence  network,  there  is  a  n  lay 
unit ,  labeled  II  in  Figure  4,  in  the  binding  network.  'I'll is 
units  fires  when  the  corresponding  event  occurs  in  the  in¬ 
put  module  (detected  by  unit  E).  For  example,  unit  III  in 
Figure  4  is  tuned  to  tire  when  the  joint,  angle  is  270  degrees 
(angle  unit  input),  and  the  angular  velocity  changes  (unit 
/:’  input)  to  22.5  degrees  per  time  step  (velocity  unit  in¬ 
put).  It  sends  activation  modulated  by  the  binding  lend  to 
the  appropriate  event,  unit,  in  the  sequence  network.  1  he 
binding  level  is  the  activation  of  the  binding  unit  in  the 
network,  labeled  H  in  Figure  4.  It  has  a  site  for  each  event 
represented  in  t  he  sequence  net  work.  A  site  receives  activa- 
t  ion  from  a  netwot  k  event  unit  and  from  t  In'  corresponding 
relay  unit.  If  a  site  receives  input  from  t  he  relay  unit,  it 
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expects  input  soon  after  from  the  sequence  network.  Oth¬ 
erwise  there  is  a  mis-match  occurring.  The  binding  unit 
checks  that  the  sites  are  receiving  activation  in  this  fash¬ 
ion,  and  if  so  increases  its  activation  level.  Otherwise  its 
activation  decreases. 

All  binding  networks  connected  to  the  same  input 
module  have  their  binding  units  arranged  in  a  mutually 
inhibitory  network  (see  Figure  3).  Similarly  all  binding 
networks  connected  to  the  same  sequence  network  are  ar¬ 
ranged  so  that  their  binding  units  inhibit  each  other.  Thus, 
even  with  locally  ambiguous  input,  so  long  as  globally  the 
input  is  determinate,  the  correct  scenario  should  be  the 
most  highly  activated.  If  evidence  for  matching  is  available 
from  other  sources,  for  instance  form  or  color  matching,  it 
can  be  used  to  influence  the  scenario  match  by  providing 
input  to  the  binding  units  in  the  binding  networks. 

F.  Preliminary  Results 

The  architecture  has  been  implemented  using  the 
Rochester  Connectionist  Simulator  [Goddard  et  al .,  1988] 
for  the  single  scenario  described  above.  The  results  depend 
on  the  actual  parameters  used  in  the  activation  functions 
and  in  the  time-delayed  links.  As  expected,  presenting  per¬ 
fect  input  causes  the  scenario  network  to  saturate  quickly 
(within  one  cycle  of  the  motion).  Presenting  imperfect 
input,  e.g.  with  one  of  the  input  modules  inactivated  to 
simulate  occlusion,  caused  the  scenario  network  to  activate 
more  slowly.  Overall  it  is  clear  that  the  architecture  solves 
the  problem,  and  moreover  that  it  can  be  tuned  along  sev¬ 
eral  dimensions:  speed,  sensitivity  to  missing  data,  sensi¬ 
tivity  to  incorrect  data.  Exactly  how  the  network  should 
be  tuned  is  a  matter  for  further  research,  and  will  require 
psychophysical  experiments. 

IV.  Related  Work 

Much  work  has  been  done  on  static  object  recognition 
([Lowe,  1985]),  and  on  the  use  of  motion  information  to 
reconstruct  object  information  [Oilman,  1979]  that  could 
be  used  for  recognition.  Yet  surprisingly  little  attention 
has  been  paid  to  the  use  of  motion  information  directly  in 
recognition. 

A.  MLD  Computation 

[Rashid,  1980]  shows  how  the  trajectory  information  used 
as  input  for  the  model  presented  here  can  be  computed 
from  real  MLD  image  sequences,  but  docs  not  seriously 
address  the  problems  of  matching  this  information  against 
a  database  of  models,  the  focus  of  this  paper,  [llolfman 
and  Flinchbaugh,  1982]  shows  that  with  the  assumption  of 
planar  motion,  two  “structure  from  planar  motion”  propo¬ 
sitions  may  be  proved,  but  again  the  indexing  problem  is 
not  addressed. 


sentation  for  temporal  sequence  in  massively  parallel  net¬ 
works.  His  focus  is  the  representation  of  temporal  con¬ 
straints,  using  single  units  to  represent  events  and  con¬ 
straint  units  to  mediate  the  links.  The  model  has  been 
successfully  used  for  word  recognition,  and  is  similar  to 
the  one  developed  in  this  paper. 

[Kleinfeld,  1987]  and  [Sompolinksy  and  Kanter,  1986] 
have  developed  sequence  storage  and  retrieval  models  using 
spin-glass  type  networks.  This  type  of  network,  after  ini¬ 
tialization  to  some  state,  evolves  through  stable  attractor 
states  at  each  iteration,  the  states  forming  the  sequence. 
The  temporal  qualities  are  achieved  through  links  with  as¬ 
sociated  delays.  These  models  are  extensions  of  the  Hop- 
field  type  [Hopfield,  1982],  and  rely  on  two  kinds  of  connec¬ 
tions:  symmetric  links  to  encode  attractor  states,  and  slow 
asymmetric  links  to  encode  transitions  between  states. 
Sequence  recognition  in  [Kleinfeld,  1987]  is  achieved  by 
adding  input  units  with  connections  to  the  memory  units. 
Activation  from  these  input  units  is  required  to  complete 
transitions  from  one  state  to  the  next. 

V.  Conclusions 

We  have  introduced  a  representation  for  articulated  stick 
figure  motion  that  is  naturally  implemented  in  a  massively 
parallel  network.  A  network  architecture  for  indexing  into 
this  memory  representation  from  biologically  plausible  in¬ 
put  has  been  designed.  The  results  of  the  preliminary  im¬ 
plementation  and  tests  are  encouraging. 

Our  immediate  task  is  to  generate  more  abstract  sce¬ 
narios  and  test  the  ability  of  the  architecture  to  discrim¬ 
inate  between  scenarios.  Further  ahead  it  will  be  neces¬ 
sary  to  design  a  way  of  integrating  indexing  into  scenario 
memory  with  indexing  into  static  form  memory.  We  are 
considering  conducting  informal  perceptual  psychology  ex¬ 
periments  to  obtain  more  insight  into  human  processing 
capabilities. 
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Abstract 


1  Introduction 


In  dynamic  imaging  situations  where  the  sensor  is  un¬ 
dergoing  primarily  translational  motion  with  a  relatively 
small  rotational  component,  it  might  seem  likely  that  ap¬ 
proximate  translational  motion  algorithms  can  be  effective 
in  determining  depth.  By  restricting  the  processing  to  the 
two  dimensions  of  translational  motion  there  is  a  great  re¬ 
duction  in  complexity  from  the  five  dimensions  of  general 
motion  (excluding  the  scaling  component  of  sensor  veloc¬ 
ity  or  displacement).  In  an  attempt  to  recover  depth  of 
points  over  a  sequence  of  frames  in  such  situations  a  set  of 
three  algorithms  was  applied  sequentially:  elimination  of 
the  effect  of  small  rotations,  then  extraction  of  the  FOE, 
and  then  determination  of  depth  of  points.  The  goal  was 
the  extraction  of  obstacles  at  a  distance  beyond  the  accu¬ 
racy  of  the  range  sensors  in  the  ALV  research  program. 


Background  -  Motion  Algorithms 
Developed  at  the  University  of  Mas¬ 
sachusetts 


In  this  paper  we  show,  however,  that  even  small  rota¬ 
tions  can  significantly  affect  the  performance  of  algorithms 
designed  for  pure  translational  motion.  In  addition,  a  the¬ 
oretical  analysis  of  the  problems  in  computing  the  FOE  is 
presented.  The  attempt  to  correct  for  small  rotations  in 
the  motion  system  described  above  was  ineffective.  Accu¬ 
rate  extraction  of  the  FOE  and  recovery  of  depth  without 
dealing  with  general  motion  was  not  achieved. 


More  recently,  a  pair  of  previously  developed  algo¬ 
rithms  have  been  combined  to  yield  a  general  motion  algo¬ 
rithm.  and  haw*  been  applied  to  sequences  of  approximate 
translational  motion.  By  determining  the  small  rotational 
components  of  motion,  more  promising  depth  results  have 
been  extracted  under  the  same  circumstances  described 
above.  Fhe  conclusion  is  that  in  order  to  determine  depth 
of  points  in  many  real  situations,  general  motion  analy¬ 
sis  must  be  applied  even  when  sensor  motion  is  primarily 
translational.  Various  alternatives  for  extracting  depth 
from  motion  are  brielly  considered. 


At  the  University  of  Massachusetts  several  algorithms 
have  been  developed  and  applied  to  real  scenes  in  an  at¬ 
tempt  to  develop  practical  and  robust  techniques  for  mo¬ 
bile  vehicle  applications.  The  goals  have  been  the  recovery 
of  sensor  motion  parameters  when  they  are  unknown,  the 
determination  of  vehicle  location  via  landmarks  in  known 
environments,  and  the  recovery  of  the  depth  of  environ¬ 
mental  points,  both  for  the  derivation  of  mechanisms  for 
obstacle  avoidance,  and  for  object  recognition. 

In  cases  of  pure  translational  motion  of  the  sensor, 
Lawton  [1,2]  developed  reasonably  robust  techniques  for 
recovery  of  the  sensor  motion  parameters  (or  equivalently 
the  focus  of  expansion  -  FOE);  these  techniques  for  FOE 
search  were  extended  by  Pavlin  [3,4].  On  several  synthetic 
image  sequences,  the  FOE  was  recovered  within  1-2  de¬ 
grees  (about  5-10  pixels)  for  axes  inside  a  45  degree  cone 
around  the  line  of  sight  and  within  5-10  degrees  outside 
that  range  [3j.  It  has  also  been  successfully  applied  to 
several  natural  scenes. 

In  another  investigation  the  goal  was  to  recover  depth 
from  known  motion.  Since  the  exact  motion  parame¬ 
ters  could  never  be  known,  some  degree  of  imprecision 
should  be  included.  Snyder  [5]  has  provided  an  anal¬ 
ysis  of  the  cfTccts  of  uncertainty  in  the  location  of  the 
FOE  and  the  tracked  feature  points  on  the  determina¬ 
tion  of  depth,  and  on  the  location  and  size  of  the  search 
window  in  future  frames,  as  a  function  of  the  uncertain¬ 
ties  in  the  FOE,  the  feature  points,  and  the  correspond¬ 
ing  environmental  depths.  Bharwani  jfi|  applied  these 
ideas  using  Ihe  lime-adjacency  relationship  in  an  iterative 
spatio-temporal  strategy  of  prediction-refinement  of  fea¬ 
ture  point  displacements  and  their  associated  depths.  The 
algorithm  was  applied  to  a  sequence  of  hand  registered 
real-world  images,  with  either  an  FOE  that  was  hand- 
supplied  or  extracted  from  the  Lawton- Pavlin  algorithm. 
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In  several  cases  this  algorithm  achieved  useful  depth  val-  dictablc  movement  of  the  “FOE”1  over  the  sequence  of 

ues  over  a  sequence  of  images  [6,7].  Thus,  the  methods  of  frames  (i.e.  there  might  be  a  varying  direction  of  trans- 

FOE  determination  and  depth  extraction,  used  together,  lational  motion  which  if  “averaged”  across  frames  would 

showed  strong  promise  for  practical  vehicle  navigation  in  be  the  “approximate”  translational  motion),  as  well  as  a 

outdoor  environments.  varying  rotational  component  that  causes  additional  ro- 

In  addition  to  the  above  a  two  stage  algorithm  for  tational  displacements  of  image  points  independently  of 

analyzing  general  motion  has  also  been  implemented.  The  ‘he  depths  of  the  associated  environmental  points.  Thus, 

first  stage  is  based  on  Anandan’s  hierarchical  extraction  of  even  though  the  vehicle  is  translating  in  some  fixed  di¬ 
displacement  vectors  [8,9,10].  The  second  stage  is  based  rection  that  remains  approximately  constant  across  many 

on  Adiv’s  algorithm  [11,12]  for  analyzing  general  motion  frames,  there  can  be  significant  local  variations  between 

in  the  presence  of  multiple  independently  moving  objects.  nearby  pairs  of  frames  in  the  sequence.  In  order  to  obtain 

accurate  depths  from  translational  motion,  it  may  be  nec¬ 
essary  to  update  ihe  FOE  location  and  remove  the  efTcct 
of  rotational  displacements. 

1.2  The  Problem  -  Obstacle  Avoidance 

via  Motion  of  an  Approximately  1-3  An  Attempt  at  Building  a  Transla- 
Translating  Mobile  Vehicle  tional  Motion  Subsystem 


The  algorithms  evaluated  in  this  paper  arc  part  of  the 
research  effort  directed  towards  the  development  of  a  sys¬ 
tem  for  obstacle  avoidance  via  motion  analysis  as  part  of 
the  Autonomous  Land  Vehicle  (ALV)  project.  The  pri¬ 
mary  goal  of  the  ALV  project  is  to  design  a  vehicle  which 
can  autonomously  navigate  through  a  wide  range  of  en¬ 
vironmental  situations.  For  obstacle  avoidance  the  depth 
of  objects  out  to  about  80  feet  needs  to  be  extracted  to 
determine  whether  they  may  be  obstacles.  The  phase- 
difference  based  ERIM  scanning  laser  range  finder  [13] 
seems  to  be  limited  to  distances  of  less  than  40  feet  Thus 
our  specific  research  goal  was  to  extract  depth  at  distances 
greater  than  the  effective  range  of  the  ERIM  sensor,  or  ap¬ 
proximately  in  the  40-80  foot  range. 

In  an  effort  to  provide  more  powerful  constraints  that 
would  enhance  the  effectiveness  of  such  a  practical  ap¬ 
plication,  we  decided  to  consider  the  case  of  depth  from 
known  translational  motion.  Since  precisely  known  sensor 
motion  is  not  possible  without  sophisticated  instrumen¬ 
tation,  the  problem  was  modified  to  one  of  approximate 
translational  motion. 

However,  in  actual  application  of  the  methods  to  the 
Carncgie-Mellon  University  NAVLAU  1 1 3]  moving  down 
a  path  that  is  approximately  horizontal  and  planar,  cer¬ 
tain  general  problems  occur.  The  vehicle  may  bounce, 
sway,  vibrate,  and  of  course  turn  slightly  while  still  pri¬ 
marily  undergoing  translation,  and  the  road  while  approx¬ 
imately  planar  may  vary  with  minor  undulations,  pits,  and 
bumps.  The  net  effect  when  examining  sequences  of  im¬ 
ages  taken  at  2  or  4  foot  intervals  makes  clear  that  there 
is  significant  “misregislrali  a  between  frames  beyond  the 
image  displacements  expected  from  smooth  translational 
sensor  motion.  I  he  expected  translational  displacement 
(i.e.  the  translation  due  to  the  average,  or  smoothed  mo¬ 
tion)  between  any  pair  of  frames  can  be  significantly  af¬ 
fected  bv  perturbations  which  involve  both  a  translational 
and  rotational  component.  T  his  could  result  in  unpre- 


Figure  1:  Processing  of  Approximate  Translational  Mo¬ 
tion  for  Depth  Extraction 


As  a  consequence  of  the  effects  described  in  the  previous 
section,  the  two  translational  motion  algorithms  for  FOE 
recovery  [1,3]  and  extraction  of  depth  [0]  mentioned  earlier 
cannot  be  applied  directly.  Either  some  form  of  prepro¬ 
cessing  is  necessary  to  allow  translational  processing  to 
be  maintained  despite  the  problems  described,  or  else  the 
complexities  of  general  motion  must  be  considered.  An 
attempt  was  made  to  carry  out  “registration"  of  the  im¬ 
ages  [14]  i.e.,  removal  of  small  rotations  by  translation 

of  the  image  as  a  simple  preprocessing  step  to  allow 

1  f f  there  is  rotation,  there  is  of  course  no  KOK;  nevertheless,  the 
translation  vector  will  intersect  the  image  plane  somewhere  and  some¬ 
times  we  will  use  the  term  l-'OK  in  this  manner. 


effective  translational  motion  analysis.  The  three  algo¬ 
rithms  shown  in  Figure  1  were  to  be  applied  sequentially 
to  determine  the  depth  of  environmental  points.  The  first 
algorithm  was  intended  to  eliminate  a  significant  portion 
of  small  rotations  by  image  registration  [14].  The  second 
algorithm  was  to  compute  the  position  of  the  FOE  for 
the  registered  images  [2,4].  The  third  algorithm  was  to 
use  the  time-adjacency  relationship  and  an  iterative  re¬ 
finement  procedure  to  compute  depth  [6]. 

1.4  Analysis  of  the  Problems 

In  view  of  all  the  difficulties  with  the  estimation  of  mo¬ 
tion  parameters,  the  difficulty  of  autonomous  navigation 
through  unknown  terrain  is  not  surprising.  Our  restric¬ 
tion  of  the  problem  to  finding  the  depth  of  points  from 
approximately  known  translational  motion  was  intended 
to  introduce  sufficient  constraints  to  overcome  these  prob¬ 
lems.  It  is  the  purpose  of  this  paper  to  draw  attention  to 
the  practical  difficulty  of  developing  such  a  system.  While 
each  of  the  algorithms  in  isolation  had  some  experimen¬ 
tal  success  in  achieving  its  intended  goal,  they  failed  as 
an  integrated  set  of  algorithms.  A  global  system  is  much 
more  difficult  to  construct  because  of  cumulative  errors 
generated  by  the  subsystems.  In  this  paper  we  show  that 
even  small  amounts  of  rotation  can  cause  large  errors  in 
determining  the  FOE  and  depth  using  translational  mo¬ 
tion  algorithms.  Simple  rotation-elimination  mechanisms, 
such  as  the  one  used  here,  often  place  too  many  restric¬ 
tions  on  the  imaging  situations  for  practical  use. 

The  effect  of  errors  on  the  estimation  of  depth  and 
motion  parameters  has  been  studied  by  many  authors 
[5,15,16,17,18).  Tsai  and  Huang  [18]  demonstrated  ex¬ 
perimentally  that  “unless  the  error  in  finding  point  corre¬ 
spondences  is  less  than  3%”  the  solutions  to  the  motion 
parameters  are  “overwhelmed  by  noise.”  They  also  assert 
on  the  basis  of  experimental  evidence  that  “error  of  the 
estimated  motion  and  geometrical  parameters  can  be  re¬ 
duced  by  using  more  point  correspondences  only  if  the 
error  for  the  latter  is  less  than  3%.”  Fang  and  Huang  j 1 7] 
showed  that  as  the  distance  between  the  object  and  the 
image  plane  increases  it  becomes  more  difficult  to  solve 
the  motion  equations  proposed  by  them.  It  must  be  noted 
that  these  studies  are  by  no  means  general  and  essentially 
apply  to  the  algorithms  proposed  or  studied  by  the  re¬ 
searcher.  However,  they  provide  us  with  insight  regarding 
the  difficulty  of  motion  parameter  and  depth  estimation. 

As  a  consequence  of  the  difficulty  of  recovering  the 
depth  points  in  practical  situations  of  approximate  trans¬ 
lational  motion,  we  return  later  in  this  paper  to  an  exper¬ 
imental  investigation  of  the  case  of  general  motion.  VV'c 
consider  a  two  stage  algorithm  for  analyzing  general  mo¬ 
tion:  Ananrlan’s  hierarchical  extraction  of  displacement 
vectors  1 8,9,10]  followed  by  Adiv’s  algorithm  [11,12]  for 
analysis  of  general  motion  in  the  presence  of  indepen¬ 


dently  moving  objects.  Initial  results  show  the  combina¬ 
tion  of  these  two  algorithms  to  provide  reasonably  good 
estimates  for  the  depth  of  environmental  points. 

2  Depth  From  Approximate 
Translational  Motion 

Recovery  of  depth  from  approximately  known  transla¬ 
tional  motion  has  been  studied  by  Bharwani,  et  al  [6]. 
This  scheme  is  based  on  the  time-adjacency  relation  and 
involves  a  prediction-based  iterative  refinement  scheme  for 
computing  increasingly  more  accurate  depth  maps.  Since 
it  is  based  on  the  time-adjacency  relation  (see  equation 
(1)),  the  position  of  the  FOE  must  be  known  before  the 
algorithm  can  be  applied.  In  the  next  section,  we  give  a 
brief  discussion  of  the  various  components  of  the  approach 
as  developed  at  the  University  of  Massachusetts. 

2.1  FOE  determination 

The  basic  algorithm  for  determining  the  FOE  was  devel¬ 
oped  by  Lawton  [2],  The  position  of  the  FOE  is  deter¬ 
mined  in  essentially  two  phases.  In  the  first  phase  distinc¬ 
tive  features  are  extracted  under  the  assumption  that  they 
can  be  more  easily  tracked  across  frames.  In  the  second 
ph  ise  the  direction  of  translation  is  found  in  the  following 
manner.  Initially  a  position  of  the  FOE  is  hypothesized 
in  the  image  plane.  Selection  of  an  FOE  constrains  each 
feature  point  to  appear  in  the  second  frame  on  the  radial 
line  emanating  from  the  hypothesized  FOE  and  passing 
through  the  corresponding  feature  point  in  the  first  frame. 
The  point  on  the  path  that  correlates  best  (up  to  some 
resolution  of  correlation  matching  with  the  feature  point 
in  the  first  frame)  will  be  assumed  to  be  the  feature  cor¬ 
respondence,  and  the  deviation  from  perfect  correlation  is 
regarded  as  an  error.  The  sum  of  these  errors  for  the  set 
of  feature  points  gives  the  total  error  for  the  hypothesized 
FOE.  The  objective  is  to  find  the  FOE  position  for  which 
the  total  error  is  minimized.  A  search  algorithm  involving 
a  coarse  sampling  of  the  FOE  followed  by  local  hill  climb¬ 
ing  was  demonstrated  to  be  effective  on  several  natural 
scenes. 

In  Pavlin,et  al.’s  modification  of  Lawton’s  algorithm, 
the  orientation  of  the  axis  of  translation  and  associated 
error  values  arc  represented  in  polar  coordinates  on  the 
unit  sphere  centered  at  the  camera  focus.  'Phis  gives  a 
more  uniform  sampling  of  the  hypothesized  FOE  positions 
than  working  with  the  image  plane.  The  error  surface  is 
constructed  by  fitting  a  smooth  surface  over  the  values 
produced  by  a  coarse  sampling  over  polar  angles.  The 
sampling  is  then  repeated  locally  around  the  minimum 
of  the  coarsely  sampled  error  surface.  This  hierarchical 
search  along  with  smoothing  was  intended  to  make  it  ro¬ 
bust  and  more  efficient.  The  smoothing  was  intended  to 


eliminate  fluctuations  ill  the  error  surface  arul  reduce  the 
need  for  finer  sampling  resolution,  thereby  reducing  the 
required  computation. 

Experimental  results  have  shown  that  this  algorithm 
is  effective  on  real  image  data  if 

1.  the  motion  is  purely  translational,  and 

2.  approximately  8  to  16  distinctive  feature  points  can 

be  found  which  can  be  reasonably  tracked  across 

frames. 

However,  for  reliable  results  when  no  a  priori  infor¬ 
mation  is  available  to  place  constraints  on  the  processing 
(e.g.  approximate  location  of  the  FOE)  the  algorithm  is 
quite  time  consuming.  In  addition  various  design  param¬ 
eters  must  be  adjusted  to  account  for  the  worst  eventual¬ 
ities  (e.g.,  the  initial  FOE  sampling  cannot  be  made  too 
coarse  because  it  might  result  in  the  selection  of  an  in 
correct  global  minimum  for  the  FOE).  Nevertheless,  this 
algoiithm  appears  to  be  quite  useful  for  images  in  many 
practical  situations. 

2.2  Depth  Determination 

The  discussion  here  describes  the  algorithm  proposed  and 
implemented  by  Bharwani,  et  al  [6J.  Consider  the  time- 
adjacency  [19]  relation: 

m  _  zjt)  m 

d(tj  v v(t~y  [  1 

where 

d(t)  :  the  radial  velocity(displacement)  in  pixels  of  a 
point  P  in  the  image 

D(t)  :  the  distance  of  the  point  P  from  the  FOE  in 
pixels 

W(t)  :  the  Z-component  of  the  actual  vcloc- 
ily(displaccment)  of  the  corresponding  scene  point. 

Z(t)  :  the  depth  of  the  corresponding  scene  point  from 
the  camera.  Hence,  depth  Z  can  be  found  via  pure  trans¬ 
lational  motion  if  we  know  W,  the  FOE  position,  and  the 
matches  across  frames. 

Essentially  the  technique  can  be  described  as  itera¬ 
tively  improving  the  precision  of  depth  estimates  over  a 
sequence  of  frames  by  using  the  depth  estimates  (along 
with  their  associated  uncertainties)  from  the  previous  time 
steps.  In  future  frames,  the  match  region  can  be  predicted 
belter,  if  the  following  are  known: 

(a)  the  current  estimate  of  the  depth  of  a  point 

(b)  the  uncertainty  associated  with  the  depth. 
Therefore,  the  method  matches  points  between  frames  at 
a  particular  match  resolution,  finds  depth  and  uses  this 
to  predict  a  smaller  match  window  to  be  searched  with 
a  finer  match  resolution.  Increasing  the  match  resolution 
was  intended  to  give  a  more  accurate  depth.  In  addition, 


the  method  can  be  guaranteed  to  have  a  constant  upper 
bound  on  computation  for  processing  between  frames  by 
controlling  the  correlation  resolution  (via  interpolation) 
of  the  feature  point  matchings  as  an  inverse  function  of 
the  size  of  the  predicted  search  window.  This  makes  it  a 
simple  scheme  for  depth  determination  with  a  sequence  of 
images  for  translational  egomotion. 

Even  when  the  motion  of  the  camera  is  known,  am¬ 
biguities  in  the  matching  process  between  frames  leads 
to  erroneous  depth  determination.  The  matching  process 
might  fail  because  of  occlusion,  highlights,  shadows,  dis¬ 
tortion  of  the  surface  over  time,  and  similarity  of  image 
features  etc.  In  addition  to  these  problems,  an  insuffi¬ 
cient  search  area  for  the  match  along  the  displacement 
path  can  result  in  no  matches  or  incorrect  matches  being 
found.  The  matching  problem  is  a  very  widely  studied 
area  in  motion  analysis  [9,20,21]  and  will  not  be  discussed 
here. 

The  major  problems  with  the  depth  determination 
algorithm  are: 

1.  The  correspondence  problem  (i.e.  effectiveness  of  the 
match  process); 

2.  The  accuracy  of  FOE  determination; 

3.  The  effect  of  small  rotations; 

A  discussion  of  the  problems  of  FOE  determination 
and  effect  of  small  rotations  is  given  in  the  next  two  sec¬ 
tions  and  is  a  key  focus  of  this  paper. 

3  Problems  of  FOE  determina¬ 
tion 

3.1  The  Image  Mode! 

We  present  here  the  equations  describing  displacement 
fields  which  will  be  used  throughout  the  paper.  Let 

•  (A ,Y,Z)  be  a  Cartesian  coordinate  system  with  its 
origin  al  the  nodal  point  of  the  camera, 

•  (x,y)  represent  the  corresponding  coordinate  system 
of  a  planar  image, 

•  /  be  the  focal  length  of  the  camera, 

•  /',  be  the  planar  image  of  the  environment  at  time 

•  /’,n  be  the  planar  image  of  the  environment  at  time 
Lu- 

An  environmental  point  l'(.X.).Z)  then  appears  at 
the  point  p( x,i/)  in  the  image  /’.  If  x  and  y  are  given  in 
pixels,  then 
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where  /  is  the  focal  length  of  the  camera  in  pixels,  given 
by 


/  = 


N 
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cot 


FOV 
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(2) 


Here  F’0Fis  the  field  of  view  of  the  camera,  and  the  image 
resolution  is  N  x  N . 

Let  the  motion  relative  to  the  camera  of  a  point  in 
the  rigid  environment  have  translational  component  T  — 
(U,  V,  W)  and  rotational  component  Ji  =  (A,  D,  C).  It  can 
be  shown  that  if 


1.  WjZ,  is  much  smaller  than  1  (i.e.,  the  translation  of 
the  point  in  the  Z  direction  is  much  smaller  than  the 
depth  of  the  point); 

2.  the  rotation  parameters  are  small;  and 

3.  the  field  of  view  is  of  the  camera  is  not  very  large; 


then  the  displacement  vector  (u,v)  of  the  image  point 
between  frames  F{  and  Fi+i  is  given  by 


„  *  HLziK  +  +  Bf  _  Cy,  (3) 
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We  note  that  u  and  v  both  have  translational  and  rota¬ 
tional  components: 


,v)  =  {ultv,)  +  (ur,vr) 

(5) 
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3.2  Overview 


optical  flow  is  not  equal  to  the  motion  field.  This  may 
be  due  to  noise,  measurement  errors  (e.g.  usually  our 
displacements  are  only  accurate  to  within  0.5  pixel),  cor¬ 
respondence  error,  etc.  Even  small  errors  in  the  optical 
flow  can  produce  motion  parameters  which  are  far  from 
the  correct  ones  [11]. 

In  the  following  subsections  we  first  establish  the  error 
in  the  FOE  position  as  a  function  of  the  error  in  the  im¬ 
age  displacements.  By  interpreting  small  rotations  as  de¬ 
viations/errors  from  the  displacements  derived  from  pure 
translation,  we  show  that  even  small  rotations  may  cause 
large  errors  in  the  FOE  position  returned  by  pure  transla¬ 
tional  methods.  As  a  corollary  we  show  how  the  different 
feature  points  should  be  weighted  to  improve  the  FOE 
computation. 


3.3  The  Effect  of  Displacement  Errors 
on  FOE  Position 

Let  us  assume  that  the  motion  is  purely  translational. 
Then  from  equation  (5)  it  follows  that 


u  =  ut 

V  =  vt 


fU  -  xW  ,  U  W 

z  =fz-xY' 

fV  -  yW  .  V  W 

z  ~~  z  -  y  z  • 


(10) 

(11) 


It  is  well  known  that  the  flow  field  given  by  (10)  and 
(1 1)  is  purely  radial,  with  center  at  the  FOE.  The  position 
of  the  FOE  (when  it  exists)  is  the  point  where  u  =  v  =  0. 
From  (10,1 1)  we  easily  see  that  this  is  the  point 

(xo,yo)  =  (f  ^>  f  i^;)-  (12) 

Suppose  now  that  an  error  is  made  in  determining  the 
displacement  (u,v),  so  that  we  find  (u',t>'),  where 

(u',v')  =  (u,v)  t- (a,b).  (13) 

Thus,  substituting  (10)  and  (11)  into  (13),  we  find: 


JJ  IF 
f  z  ~  x  z  +  a' 
r  V  IF  , 

!  z  -  yY  4  h- 


(14) 

(15) 


The  motion  Jicld  is  the  projection  of  the  3-1)  veloc¬ 
ity/displacement  field  onto  the  image  plane.  Given  the 
motion  field,  the  motion  parameters  (U,V,\V,  A,  B,C) 
and  the  scene  structure  (Z)  can  be  uniquely  determined 
up  to  a  scale  factor  -  except  for  a  two-way  ambiguity  in 
certain  special  cases  [22]. 

The  optical  jlow  is  the  image  velocity  or  displacement 
recovered  from  the  image.  If  the  optical  flow  is  precisely 
equal  to  the  motion  field,  the  motion  parameters  and  scene 
structure  recovered  are  the  correct  ones.  In  general,  the 


For  purposes  of  illustration,  let  us  assume  that  the  scene 
being  viewed  is  coplanar  at  depth  Z  with  the  image  plane, 
and  that  (a,l>)  is  Ihr  same  for  each  image  point.  We  can 
combine  the  coordinate  independent  terms  in  (14)  and 
(15)  to  find: 


(U  I  W)  IF 

1  Z  Z' 


r(F  l  AF) 
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where  the  constants  6U  and  bV  are  given  by 

bU  =  aZ/f,  bV  -  bZ/f.  (16) 

The  displacement  field  is  therefore  of  the  oame  form 
as  (10,11),  and  hence  there  is  still  an  FOE  in  this  case, 
located  (according  to  (12))  at 


(*o>2* i) 


I 

) 


where 


U  +  6U  V  J-  bV 

/— f~)  =  (*o,Jlo)  1  (Sx0,byB), 

(17) 

(Sx0Jy0)  -  ^  (a,b).  (18) 


We  will  call  this  the  “error  in  the  FOE”  due  to  the  error 
(a,  6)  in  the  image  displacement  of  the  point  located  at 
depth  Z. 

Suppose  now  that  we  allow  the  displacement  errors 
(a,b)  in  (13)  to  vary  with  position  in  the  image  plane, 
and  that  the  environment  is  no  longer  coplanar  with  the 
image  plane.  It  follows  that  the  “errorful”  displacement 
will  no  longer  be  radial,  and  hence  there  will  not,  in  gen¬ 
eral,  be  an  FOE.  However,  we  could  still  apply  our  purely 
translational  algorithm  to  this  flow  field.  How  will  the 
“errorful”  flow  field  afTect  our  algorithm? 

This  is  clearly  not  a  question  that  can  be  given  a 
general  answer.  In  fact,  the  only  result  we  can  quote  is 
equation  (18),  which  we  have  proved  only  in  the  case  of 
constant  depth  and  constant  error  (a, 6).  We  can,  how¬ 
ever,  conclude  one  general  rule  from  (18):  for  fixed,  con¬ 
stant  displacement  error,  distant  points  (those  for  which 
Z  .>  IF)  will  contribute  a  larger  error  to  the  position  of 
the  “FOE”  than  will  nearby  points  (thus  for  which  Z  and 
IF  are  comparable).  This  is  to  be  expected  on  intuitive 
grounds  as  well,  since  (generally  speaking)  distant  points 
will  have  smaller  displacements  than  nearby  points,  and 
so  a  given  displacement  error  will  have  a  proportionally 
larger  effect  on  the  displacement  of  distant  points  than  on 
that  of  nearby  points. 


I  his  situation  is  illustrated  in  Figure  2.  In  figure 
2(a)  the  shift  in  the  FOE  is  smaller  than  in  f  igure  2(b) 
because  the  displacement  vectors  </,  and  <l2  in  Figure  2(a) 
arc  larger  than  the  corresponding  displacement  vectors  in 
Figure  2(b).  I  he  displacement  vectors  are  larger  when 
the  points  are  closer.  Heme,  the  larger  percentage  error 
in  displacement  for  distant  points  returns  a  I  O Is  position 
with  great er  rrn >r. 


We  note  that  no  assumption  about  the  source  of  the 
di  .placemen ’  error  has  been  made  it  could  be  noise  in 
i  he  input  image,  correspondence  error,  or  anything  else. 

I  he  usual  assumption  is  that  the  displacement  errors  are 
due  to  random,  uneorreiated  noise.  1  his  would  lead  to 
displacement  errors  which  are  random  in  both  iiiagint  tide 
ms;  mpTiion.  ami  soon  average  would  not  lead  to  a  large 
e ;  1 1 1  r  m  !  lie  position  o|  t  lie  f()f..  However,  if  tli'"e  is  a 
- ;  -t  cm,  it  a  e,  p  ,i .  f  r  i  n  - :  a  me  v.  hen  i  ),e  displacement  er  pa 


Figure  2:  Shift  in  the  FOE  relative  to  the  size  of  the 
displacement  vectors 
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is  a  constant,  then  1  he  FOE  error  would  be  expected  to  be 
given  by  an  expression  something  like  equation  (18).  In 
the  next,  section,  we  show  that  under  certain  not.  terribly 
restrictive  conditions  rotations  can  give  rise  to  just  such 
a  systematic  error. 


3.4  Rotations  as  a  Source  of  Systeniat  ic 
ICrror 

W  hen  *  he  st  ene  u  nr  terrors  a  fouie  li  rii»id  nmt  i«  >n.  '  lo  di¬ 
pt  ace  men  !  field  will  lie  tti  veil  by  i  "> ),  with  1 1  a  imlalb  mat  and 
rota'.ii'iial  component"  In  mpiatbms  {(if  fti.  We  ran.  a  la 
equal  ion  i  i  d  i.  eon  siller  the  rot  a t  i<ujal  ( <  >ni  pom-  n  t  s  i  n . .  >  •„ 
a*'  an  err*»r  term  <>n  the  uiidi  r!y  i  i;e  pure  t  ram 'at  u  •  i  ■ .  < ;  . ;  i  - 
phu  enien  I  li*l.i  if/,.  .  .  I  : « -Hi  i  s  ;  a  ml  it  *  > .  \m-  mi  t  li  a  i  ,  |; ; 
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Table  1:  Rotational  contributions  to  optical  flow  as  a  function  of  position.  Shows  the  (constant, linear, quadratic) 
contributions  (in  pixels)  to  the  x-componeiit  of  the  displacement  field  at  various  spatial  locations 


x(0.54/-0.22/0.19)  x(0.54/-0.22/0.07)  x(0.54/-0.22/0)  x(0.54/-0.22/-0.02)  x(0.54/-0.22/0) 


x(0.54/-0.1 1/0.14)  x(0.54/-0. 11/0.05) 


x(0.54/0. 11/0.05)  x(0. 54/0. 11/0) 


(0.54/-0.  II  /0)  x(0.54/-0.1 1/0)  x(0.54/-0. 1 1/0.05) 


x(n.54/n/n.n9)  x(fl.54/0/0.02) _ UjL5A/l)/<))  x(Q  Ji4/0/QJ12 ) _ x  (0. 54/0/0.09) 


W0.54/0.11/0)  x(0.54/0. 11/0.05)  x(0.54/0.1  1/0.14) 


x(0. 54/0. 22/0)  x(0.54/0.22/-0.02)  lx(0. 54/0. 22/0)  x(0.54/0.22/0.07)  x(0.54/0.22/U.19) 


“error'’  term  has  the  following  structure: 

ur  {»  !<[}  ,  {-Cy}  !  {  AVfllTr},  (19) 

{  Af)  •  {•(>}  .  {  *-y-j (20) 

Th*il  is,  each  component.  has  a  constant  term,  a  term 
which  varies  linearly  with  image  coordinates,  and  a  term 
which  is  quadratic  in  image  coordinates.  I  his  is  illus¬ 
trated  in  Figure  3,  where  we  see  how  the  flow  deviates 
mure  and  more  from  a  constant  as  we  approach  the  pe¬ 
riphery  « » f  the  image*,  where  the  linear  and  quadratic  terms 
ben mie  important. 

Let  u::  see  how  the  constant,  linear,  and  quadratic 
c  u  n  1 1  i  but  bins  compare  to  one  another.  We  recall  from 
equation  (2)  that  /  is  the  camera  focal  length  in  pixels. 


For  instance,  Tor  a  250  x  250  image  with  15"  field  of  view, 
/  is  about  309  pixels.  The  coordinates  x  and  ?/,  on  the 
other  hand,  are  limited  to  being  less  than  or  equal  to  ;V/2 
in  magnitude,  which  in  this  case  is  128.  'Thus, 

V  ,  128 

,|,|,!  <  tan  L  OV/2  ""  0.1.  (21) 

7  7  ’  309  v  ' 

For  rotational  angles  which  are  comparable  (.1  '  H  '  (’), 
we  see  from  (19,20)  that  the  order  of  magnitude  of  the 
linear  and  quadratic  terms,  with  respect  to  the  constant 
term,  is  given  roughly  as 

li  near 

•  0.1, 

constant 
quadrat  ic 
<  *  •uslaut 


2  ■  (II.  I)2  D..T2 


r  vm 


Indeed,  these  upper  limits  are  reached  only  at  particular 
regions  near  the  image  boundaries.  For  the  greater  part  of 
the  image  surrounding  the  center,  the  linear  and  quadratic 
terms  are  much  smaller  than  the  constant  term.  This  is 
illustrated  in  Table  1,  where  we  show  the  values  taken  by 
the  constant,  linear,  and  quadratic  terms  contributions  to 
u  at  various  points  in  a  256  x  256  image,  with  a  field  of 
view  of  do",  for  a  rotation  of  A  II  C  -  0.1°. 

'1'his  leads  us  to  make  the  assumption  that  the  dom¬ 
inant  effect  of  rotations  is  to  add  a  constant  term  to  the 
displacement  Held.  We  see  that  this  will  be  a  good  as¬ 
sumption  when: 

1 .  The  field  of  view  of  the  camera  is  small  (say  <  d5"). 

2.  I  he  7  component  of  the  rotation  is  comparable  to, 
or  smaller  than,  the  A’  and  V  components  of  the 
rotation. 

In  particular,  the  assumption  of  constant  displace¬ 
ment  error  is  not  valid  when  the  primary  rotation  is 
around  the  line  of  sight  (roll)  (i.e.,  when  C  is  much  larger 
than  .1  or  II),  since  the  linear  term  will  then  be  larger 
than  the  constant  term  over  a  large  portion  of  the  image. 

3.5  The  Effect  of*  Rotation  on  FOE  error 

In  this  sect  ion  we  i  ombme  the  results  of  the  pi  ev  ions  two 
sections  to  find  how  rotations  affect  tin-  position  of  the 

lot 

In  So*  lion  IF. 'I  we  showed  that  for  an  environment 
coplniiar  with  the  image  plane  located  at  depth  7  a  con 
slant  (lispin' a  moot  error  leads  to  a  displacement  field 
■ahull  is  still  pun  Iv  t  raiislal  ional,  so  t  hat  an  FOl'i  exists. 
I  lie  erioi  in  this  I  Ol,  due  to  t  lie  displacement  error  is 
s':  ■,  i-n  1 1 ■,  .  i j i in  *  ion  i  I  s  j  1 1 ;  S*-(  t  ion  it  1  we  a r I* tie* I  that  n n 
eat  iiitum  not  terribly  restrictive  t  •  inditions.  the  domi 
:  n  1 1 1  'll'  -  i  ot  i .  at  i"ii  ■  was  to  add  a  t  a  nisi  a  tit  et  on  to  I  In- 


displacements,  given  by  the  constant  term  in  equations 

(10,20): 

(«,/>)  (Ilf,  Af).  (22) 


lly  combining  (18)  and  (22),  we  find  that  the  -V  and 
V  rotations  give  an  KOK  error  (in  pixels)  of 


(■Sj-o,  At/,,)  w(H,  d). 


(23) 


To  see  how  large  an  effect  this  is,  let  the  FOV  for  a 
256  x  256  image  be  45",  and  let  II  A.  In  Fable  2,  we 
show  \6x0\  for  various  values  of  7/ IT  and  of  II.  'I  he  first 
column  gives  the  rotational  angle,  the  second  the  displace¬ 
ment  error  (in  pixels)  for  each  angle,  and  the  remaining 
columns  give  the  FOF,  error  (in  pixels)  for  each  angle. 


'Fable  2:  Error  in  FOE  (in  pixels)  with  rotational 
angle  and  Z/W 
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Of  course,  it  is  exceedingly  rare  that  the  environment 
and  the  image  are  coplanar,  so  these  results  cannot  be 
strictly  applied.  However,  we  can  draw  a  few  general  con 
elusions  from  this  analysis.  The  first  conclusion  is  that 


Figure  4:  Estimated  FOE  obtained  by  extending  the 
larger  flow  vectors. 
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of  rotation,  the  error  in  the  FOE  can  be  large.  This  er¬ 
ror  will  be  larger  when  the  environmental  points  used  to 
find  the  FOE  are  more  distant.  It  would  therefore  be  ad¬ 
vantageous  to  weight  nearby  environmental  points  more 
than  distant  ones.  We  note  that  the  fundamental  reason 
for  this  is  that  nearby  points  will  in  general  have  larger 
interframe  displacements  than  distant  points.  Thus,  to 
minimize  the  FOE  error  we  should  give  points  with  large 
displacements  greater  weight  than  those  with  small  dis¬ 
placements  (eg.,  lugure  4  shows  the  larger  flow  vectors 
obtained  from  frame  pair  9-11  of  Figure  5  extended  to 
give  an  approximate  FOE). 

The  second  conclusion  is  that  since  most  schemes 
round  off  displacements  to  the  nearest  half-pixel,  the  first 
row  in  Table  2  gives  the  worst -case  error  in  finding  the 
FOE  for  such  schemes.  In  general,  such  rounding-  ofT 
will  give  randomly-distributed  displacement  errors,  which 
should  not  have  much  effect  on  the  FOE.  The  worst-case 
error  assumes  that  the  displacement  errors  are  not  ran¬ 
dom,  and  hence  do  not  cancel  out. 

4  Removing  Rotation  without 
Analyzing  General  Motion 

VV'c  have  shown  in  the  previous  sections  that  if  an  algo¬ 
rithm  which  assumes  purely  translational  motion  is  ap¬ 
plied  to  an  image  sequence  which  has  even  a  small  amount 
of  interframe  camera  rotation,  the  “FOE”  found  by  the  al¬ 
gorithm  can  be  far  from  the  FOE  that  would  be  found  if  no 
rotation  were  present.  Thus  even  small  rotations  can  se¬ 
riously  degrade  the  performance  of  a  purely  translational 
algorithm. 

One  possible  solution  to  this  difficulty  would  be  to 
somehow  remove  the  (small)  rotational  component  of  the 
camera  motion,  producing  a  so-called  registered  image  se¬ 
quence  in  which  the  inter-frame  camera  motion  is  purely 
translational.  This  possibility  has  recently  been  investi¬ 
gated  by  I’avlin  and  Snyder  [Id].  In  the  next  sections  we 
discuss  this  registration  scheme,  its  limitations,  and  some 
experimental  results  from  using  it. 

4.1  The  Registration  Algorithm 

Consider  a  dynamic  image  sequence  {  /'  j , . . .  ,  /*’„}  i  n  which 
the  camera  is  allowed  to  have  general  inter  frame  motion. 
If  we  wish  to  run  a  purely  translational  motion  algorithm 
on  this  sequence  we  will,  as  just  discussed,  obtain  unre¬ 
liable  results  in  general  because  of  the  existence  of  small 
rotational  components  to  the  inter  frame  motion.  Sup 
pose,  however,  that  we  could  somehow  find  the  rotational 
components  of  the  inter  frame  motion.  We  could  then 
subtract  out  this  motion  from  the  second  of  each  image 
pair,  obtaining  t hereby  a  n<  i r  image  sequence  in  which  the 
inter  frame  camera  motion  is  purely  translational.  I  his 


new  image  sequence  {F"8, . . .  ,  F j]'g}  will  be  called  a  regis¬ 
tered  image  sequence.  The  translational  motion  algorithm 
could  then  be  safely  applied  to  this  registered  sequence. 

Unfortunately,  for  reasons  given  in  [14],  the  construc¬ 
tion  of  the  registered  image  sequence  is  in  principle  im¬ 
possible,  but  an  approximation  to  a  registered  im.»ge  se¬ 
quence  can  be  constructed  as  follows  (sec  [14]  for  more 
details,  justifications,  etc.).  The  key  idea  behind  the  reg¬ 
istration  algorithm  is  that  (infinitely)  distant  points  move, 
from  frame  to  frame,  only  because  of  rotations  (this  is  a 
trivial  consequence  of  the  general  optical  flow  equations). 
Furthermore,  if  these  points  arc  not  too  far  from  the  FOE, 
the  terms  in  ( ur)vr )  which  are  quadratic  in  image  coor¬ 
dinates  will  be  negligible  compared  to  the  constant  and 
linear  terms,  and  hence  may  be  ignored.  We  will  call  the 
points  satisfying  these  conditions  the  set  of  registration 
points.  As  a  consequence,  the  displacement  field  for  such 
registration  points  is  extremely  simple: 

u  •-  uT  -  {Ilf}  i  |  Cy) 

v  —  vr  -  {-Af}  i-  {  t  Or}. 

We  note  that  this  displacement  field  is  just  a  rotaUon  by 
an  angle  C  around  the  origin,  followed  by  a  translation 
(13  f,  -  Af)  in  the  image  plane,  the  combination  of  which 
is  just  a  general  rigid  transformation  of  the  image  plane. 
If  we  now  consider  the  set  of  all  such  registration  points, 
it  is  clear  from  the  displacement  field  above  that  these 
registration  points  move  from  one  image  to  the  next  as  a 
rigid  body. 

The  idea  in  [14]  is  to  use  an  optical  flow  algorithm, 
such  as  that  of  Anandan  [8] ,  to  find  point  correspondences 
between  successive  frames  of  the  unregistered  image  se¬ 
quence.  Picking  a  certain  set  of  such  points  and  consid¬ 
ering  it  as  a  (two-dimensional)  rigid  body,  one  can  define 
the  center  of  mass  of  this  set  of  points,  and  the  principal 
axes  of  its  inertia  tensor.  The  rotation  angles  A  and  II  can 
then  be  found  by  finding  the  displacement  of  the  center 
of  mass  of  this  rigid  body  between  the  frames  in  question, 
and  the  rotation  angle  C  can  be  found  by  finding  the  rota¬ 
tion  of  the  principal  axes  between  the  two  frames.  Thus, 
the  inter  frame  rotational  component  of  the  camera’s  mo¬ 
tion  can  be  found. 

Once  these  rotations  have  been  found,  then  the  regis¬ 
tered  image  can  be  obtained  by  subtracting  olf  the  rota¬ 
tional  component  of  the  displacement  at  each  pixel.  Un¬ 
fortunately,  this  is  a  nonlinear  transformation,  and  hence 
will  be  computationally  expensive.  As  a  consequence, 
the  registered  image  is  found  by  linearizing  the  rotational 
component,  that  is  by  performing  the  inverse  of  the  rigid 
motion,  i.e.,  translate  the  second  image  by  (  Ilf,  I  Af). 
then  rotate  it  bv  C  around  the  origin. 

In  practice,  the  registration  algorithm  of  I  I,  trails 
laics  the  second  image  by  (  Ilf,  :  Af).  rounded  oil  to 
the  nearest  whole  pixel,  and  does  not  perform  the  rota 
lion  by  (’  around  t  lie  origin .  The  finnici  approximation 
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is  made  in  the  interests  of  computational  efficiency,  since 
sub-pixel  interpolation  would  be  expensive.  The  latter 
approximation  is  made  since  the  quantization  grid  of  a 
rotatated  image  does  not  coincide  with  the  initial  quanti¬ 
zation  grid,  so  a  complicated  sort  of  interpolation  would 
be  necessary. 


4.2  Difficulties  with  the  Registration  Al¬ 
gorithm 


There  are  a  number  of  difficulties  with  the  registration 
algorithm  described  above,  but  the  primary  one  is  this: 
How  do  you  find  the  set  of  registration  points,  those  en¬ 
vironmental  points  which  are  infinitely  distant  from  the 
image  plane,  yet  whose  image  projections  are  close  to  the 
FOE  ? 

The  answer,  of  course,  is  that  it  is  difficult  (if  not 
impossible)  to  know  these  things  a  priori.  To  be  sure, 
in  some  domains  one  might  be  able  to  specify  a  region 
of  the  image  in  which  distant  points  might  be  expected 
to  be  found,  but  in  general,  one  rnay  not  know  this.  The 
second  point  we  want  to  make  is  that  genuinely  “infinitely 
distant”  points  are  rare,  so  the  question  then  becomes  one 
of  whether  distant  points  are  distant  “enough.” 

The  basic  idea  is  that  an  environmental  point  is  dis¬ 
tant  “enough”  if  its  image  displacement  due  to  translation 
is  much  smaller  than  its  image  displacement  due  to  rota¬ 
tion.  Only  then  can  one  say  that  the  bulk  of  the  displace¬ 
ment  is  due  to  rotation.  We  can  easily  derive  a  condition 
that  expresses  this.  From  the  time-adjacency  relation  ( I ), 
we  see  that  the  displacement  d  due  to  translation  is  just 


d  =  D 


W 
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(2d) 


For  image  points  near  the  center  of  the  image,  their  dis¬ 
placement  is,  as  we  have  shown  previously,  of  order  Af  (we 
assume  that  D  =  C  -  0  for  simplicity).  Combining  these 
two  results,  we  sec  that  the  translational  displacement 
(</)  will  be  much  smaller  than  the  rotational  displacement 
(Af)  if 
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This  therefore  implies  that  a  point  is  “distant”  enough 
only  if  ZjW  is  much  larger  than  Zq/\V ,  where 
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In  fable  ,'f  we  list  the  values  that  the  quantity  Z0f\V  takes 
for  various  values  of  rotation  Af  and  various  distances  I) 
from  the  “I- OF"  (in  pixels).  For  example,  with  a  rotation 
of  0.2  degrees  and  a  point  at  a  distance  of  25  pixels  from 
the  “FOE”,  (v  is  2.1.  Therefore  in  our  experiment  with  a 
translation  of  four  feet  between  frames  (i.e.  \\  1  feet)  a 


point  must  be  greater  than  92  feet  for  the  approximation 
to  hold.  For  other  rotations  and  distances  from  the  FOE, 
the  values  required  can  be  considerably  larger.  We  sec 
that  except  for  quite  large  angles  (of  the  order  of  several 
degrees),  and  image  points  quite  near  the  FOE,  “distant” 
points  arc  unlikely  to  be  found  in  typical  images. 


Table  3:  Minimal  distances  Z0/W  for  which  points 
can  be  considered  “distant”.  Only  points  for  which 
Z/W  z0/\v  can  be  considered  “distant”  (see  text  for 
an  explanation) 


Rotation  A 

Af 

Distance  D  in  pixels 

(degrees) 

(pixels) 

25 

50 

75 

100 

0.1 

0.5 

50 

100" 

150 

200 

0.2 

1.1 

23 

45 

G8 

91 

0.5 

2.7 

9.3 

18 

28 

37 

f.O 

5.4 

4.6 

9.3 

14 

19 

2.0 

10.8 

2.3 

4.0 

6.9 

9.3 

5.0 

27.0 

0.9 

Li-9J 

2.8 

3.7 

Additional  difficulties  with  the  registration  algorithm 
are  of  an  itnplcmentationa!  nature.  The  approximation 
■of  the  (nonlinear)  rotational  displacement  by  its  linear 
terms  will  introduce  errors,  especially  in  the  displacements 
of  peripheral  points  (which  arc  precisely  those  with  the 
greatest  displacement  in  general,  and  which,  as  we  saw 
in  the  previous  section,  arc  therefore  those  which  should 
receive  the  greatest  weight  in  the  finding  of  the  FOE). 
The  rounding  ofT  of  the  rigid  displacement  (-  /?/,  I  Af) 
to  the  nearest  pixel  could  introduce  errors  of  up  to  100% 
in  the  removal  of  the  rotational  parameters  (for  instance, 
if  Hf  =  0.51  pix,  it  would  be  rounded  off  to  I  pix).  The 
neglect  of  the  “roll”  term  by  not  performing  the  rotation 
about  the  origin  by  -C  could  introduce  severe  difficulties 
into  an  FOE  finding  algorithm. 

There  are,  as  mentioned  in  [Id],  several  ways  of  elim¬ 
inating  some  of  the  problems  with  the  registration  algo¬ 
rithm.  file  most  severe  problem  with  the  algorithm,  the 
difficulty  of  ensuring  that  the  registration  points  are  dis¬ 
tant  points,  could  be  minimized  if,  for  instance,  motion 
were  integrated  with  stereo,  or  with  some  sort  of  active 
sensing.  The  latter  modules  could  then  be  used  to  select 
regions  in  the  image  which  contain  dis'ant  points. 

Other  difficulties,  which  are  essentially  related  to  the 
finite  resolution  of  the  image  plane,  would  be  minimized 
by  working  with  high  resolution  images.  In  this  regard, 
it  should  be  noted  that  aside  from  the  correspondence 
component,  the  complexity  of  the  registration  algorithm 
is  independent  of  the  resolution. 

In  order  to  see  how  the  the  combination  o  the  reg 
ist ration  algorithm  and  the  FOE  finding  algotil  mi  per 
forms,  we  ran  the  combined  algorithm  on  several  frames 
from  the  image  sequence  described  in  the  next  Section, 
and  illustrated  in  f  igure  5.  I  he  position  found  for  the 
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5.1.1  Displacement  Field  Determination 


Table  4:  Recovered  FOE’s  with  and  without  Reg¬ 
istration.  The  image  size  is  256  by  256  with  (0,0)  at  the 
centre.  The  estimated  FOE  is  around  (27,80). 


Frames 

FOE 

(unregistered 

frames) 

Registration 
Correction 
in  pixels  (x,y) 

FOE 

(registered 

frames) 

1-3 

(177,149) 

(-0.1,  1.9) 

(44,135) 

3-5 

(162,47) 

(-0.8,  1.1) 

(261,163) 

5-7 

(226,142) 

(0.0,  1.6) 

(36,111) 

7-9 

(244,109) 

(0.0,  1.6) 

(282,126) 

9-11 

(243,109) 

(0.1,  1.1) 

(51,76) 

FOE  from  each  pair  of  frames  is  shown  in  Table  4.  The 
“correct”  FOE  position  (found  from  known  camera  pa¬ 
rameters  and  measured  vehicle  motion)  should  be  some¬ 
where  around  (i0l  Vo)  =  (27,80),  where  the  origin  is  in  the 
center  of  the  image,  and  the  distances  are  given  in  pixels. 
We  see  that  some  of  the  values  found  for  the  FOEs  are  not 
too  bad,  while  others  are  completely  erroneous  (although 
even  the  best  in  this  experiment  would  not  be  sufficient 
for  depth  extraction).  We  conclude  that  this  combination 
of  algorithms  does  not  generally  give  reliable  FOEs. 

The  solution  to  our  problem  would  seem  to  be  an 
algorithm  that  finds  both  the  rotational  and  translational 
motion  parameters  (i.e  the  FOE)  at  the  same  time.  That 
is,  we  should  consider  a  general  motion  algorithm.  In  the 
next  section  we  consider  such  an  algorithm  constructed 
from  the  work  of  Anandan  and  Adiv  at  UMass,  and  show 
that  promising  results  are  obtained  on  the  CMU  NAVLAB 
image  sequence. 

5  Depth  from  a  General  Motion 
Algorithm 

In  this  section  we  review  a  general  motion  algorithm  to  es¬ 
timate  the  motion  parameters  and  depth  as  implemented 
at  the  University  of  Massachusetts.  We  then  show  the  re¬ 
sults  of  applying  this  algorithm  to  a  real  image  sequence. 
Finally,  we  compare  this  algorithm  to  the  approximate 
translational  motion  scheme  discussed  in  the  previous  scc- 


5.1  The  General  Motion  Algorithm 

The  general  motion  algorithm  implemented  at  the  Univer¬ 
sity  of  Massachusetts  to  compute  3-l>  motion  parameteis 
and  depth  is  based  on  the  work  of  Anandan  ;8,9,IOj  and 
Adiv  [I2i.  We  discuss  the  two  phases  of  the  algorithm 
below. 


In  the  first  phase  Anandan’s  [8]  algorithm  is  used  for  de¬ 
termining  displacement  fields.  It  operates  on  a  pair  of 
images  and  uses  a  hierarchical  correlation  matching  ap¬ 
proach.  A  multi-resolution,  multiple  spatial  frequency 
channel  representation  of  the  images  is  used  for  efficient 
computation  and  accurate  determination  of  displacement 
fields.  Confidence  measures  indicating  the  reliability  of 
each  displacement  vector  arc  provided.  A  smoothness  con¬ 
straint  is  then  used  to  correct  unreliable  vectors  based  on 
the  reliable  ones. 

5.1.2  Depth  and  Motion  Parameter  Estimation 

In  the  second  phase  the  flow  field,  along  with  associated 
confidences,  found  via  Anandan’s  algorithm  is  taken  as 
input  to  Adiv’s  [11,12]  algorithm.  This  algorithm  consists 
of  two  steps,  which  we  now  describe. 

In  the  first  step  the  flow  field  is  segmented  into  con¬ 
nected  sets  of  flow  vectors  where  each  set  is  consistent 
with  the  rigid  motion  of  an  approximately  planar  patch 
The  segmentation  is  based  on  a  modified  version  of  the 
generalized  Hough  transform,  with  displacement  vectors 
voting  for  the  motion  parameters.  It  is  expected  that 
each  segment  would  correspond  to  the  motion  of  a  por¬ 
tion  of  only  one  rigid  object.  The  segmentation  approach 
makes  it  possible  to  deal  with  independently  moving  ob¬ 
jects.  When  there  are  no  independently  moving  objects, 
this  stage  of  segmenting  the  flow  field  is  unnecessary  and 
the  entire  static  environment  can  be  viewed  as  a  single 
rigid  object  in  the  second  phase  of  the  algorithm. 

In  the  second  step,  the  segments  found  in  the  first  step 
arc  grouped  together  under  the  hypothesis  that  they  have 
been  induced  by  a  single  rigidly  moving  object  (i.e.  the 
planar  surface  assumption  is  dropped).  This  is  done  by 
computing  the  optimal  motion  parameters  and  related  er¬ 
ror  measure  for  each  segment  by  employing  a  least-squares 
approach  that  minimizes  the  deviation  between  the  mea¬ 
sured  flow  fields  and  that  predicted  from  the  estimated 
motion  and  structure.  For  each  hypothesized  FOE  the 
optimal  rotation  parameters  and  the.  related  error  value 
are  computed.  A  multiresolution  discrete  sampling  tech¬ 
nique  is  used  to  compute  the  minimum  value  of  the  result¬ 
ing  error  function.  This  step  essentially  involves  grouping 
segments  of  the  flow  field  which  arc  consistent  with  the 
same  motion  parameters. 

Following  the  computation  of  3-0  motion  parameters 
the  depth  can  easily  be  computed  if  the  total  translation 
between  frames  is  known. 

5.2  Experimental  Domain 

In  an  elforl  to  determine  ground  truth  for  the  experiments 
a  sequence  of  20  images  were  taken  using  the  NAVI  All 
at  Carnegie- Mellon  University.  The  field  of  view  of  the 


Table  5:  Depth  Values  of  some  points  over  a  sequence  of  frames  using  the  General  Motion  Algorithm.  The 
two  tables  used  100  and  200  points  respectively.  Depths  are  in  feet.  *  and  @  indicate  respectively  that  the  point  was  not 
among  the  top  100  or  top  200  Moravcc  points.  **  indicates  that  the  point  is  absent  in  the  image-pair. 


1-3 

1-3 

3-5 

3  5 

5-7 

5-7 

7-9 

7-9 

9-11 

9-11  1 

Object 

pts. 

Rxptl 

True 

Exptl 

True 

Exptl 

True 

F/xpll 

True 

Exptl 

Tree 

concl 

~“l— ' 

65.7 

76 

“58.3 

72 

61.7 

68 

50.3 

64 

61.2 

60 

2 

66.9 

76 

67.7 

72 

60.5 

68 

63.6 

64 

59.6 

60 

conc.2 

3 

61.4 

76 

67.2 

72 

65.1 

68 

63.0 

64 

63.9 

60 

4 

60.8 

76 

82.3 

72 

56.2 

68 

61.7 

64 

61.7 

60 

cone3 

5 

50.2 

56 

59.2 

52 

46.3 

48 

40.8 

44 

38.4 

40 

6 

51.1 

56 

49.6 

52 

46.1 

48 

41.0 

44 

38.5 

40 

cone4 

7 

50.3 

56 

53.8 

52 

44.4 

48 

35.9 

44 

37.9 

40 

8 

46.3 

56 

53.3 

52 

47.6 

48 

41.8 

44 

39.8 

40 

can 

9 

44.1 

46 

44.4 

42 

47.6 

38 

41.8 

34 

39.8 

30 

10 

* 

46 

* 

42 

* 

38 

* 

34 

* 

30 

11 

* 

46 

* 

42 

* 

38 

* 

34 

* 

30 

cone5 

12 

31.0 

36 

32.2 

32 

26.0 

28 

22.0 

24 

20.0 

20 

13 

31.1 

36 

31.1 

32 

26.3 

28 

22.5 

24 

20.8 

20 

14 

31.9 

36 

30.9 

32 

28.5 

28 

21.9 

24 

20.5 

20 

coneO 

15 

18.1 

21 

* 

17 

** 

** 

** 

** 

** 

+  * 

16 

18.4 

21 

* 

17 

** 

** 

** 

** 

** 

** 

17 

18.9 

21 

-29 

17 

** 

*  * 

*  * 

*  * 

*  * 

** 

18 

18.0 

21 

-42 

17 

** 

** 

** 

** 

** 

** 

1-3 

1-3 

3-5 

Exptl 

3-5 

5-7 

5-7 

7-9 

7-9 

True 

9-11 

llptl 

9- 11 

True 

Object 

pts. 

ExptI 

True 

True 

Exptl 

True 

Exptl 

can 

“9" 

TlsTT 

~ 46~ 

1.3.3 

42 

36.5 

38 

“  3272“ 

34 

“29.0” 

“30“ 

10 

40.4 

46 

Q 

42 

38 

30.3 

34 

32.4 

30 

11 

44.2 

46 

42.5 

42 

..  ...  ....  . 

38 

32.7 

34 

ca 

30 

camera  was  measured,  as  was  the  inclination  of  the  camera 
axis  from  the  road  plane.  Traffic  cones  were,  placed  at 
measured  locations  in  the  environment,  and  the  actual 
interframe  translation  of  the  vehicle  was  measured  and 
marked  on  the  roadway,  so  that  ground  truth  depth  data 
could  be  reasonably  precisely  obtained.  We  obtained  a 
sequence  of  about  21)  usable  images  some  of  which  are 
shown  in  Figure  5. 

5.3  Experimental  Resuits 

Kxperimcnts  to  determine  depth  and  motion  parameters 
were  conducted  using  the  general  motion  algorithm  de¬ 
scribed  above.  To  speed  up  computation  of  Anandan’s 
algorithm,  “interesting”  or  distinctive  points  were  found 
via  the  Moravcc  operator  in  the  images  and  How  fields  de¬ 
termined  only  at  these  locations.  Two  sets  of  experiments 
were  performed  -  one  using  !()()  points  and  the  other  us¬ 
ing  2UU  points).  The  flow  fields  are  shown  in  Figure  5. 
False  matches  are  sometimes  produced,  particularly  near 


the  boundary  of  the  image  due  to  feature  points  moving 
out  of  the  image.  This  is  particularly  noticeable  in  the 
frame- pairs  (3-5)  and  (5-7). 

Adiv’s  algorithm  was  run  on  the  flow  fields  produced 
by  Anandan’s  algorithm.  As  mentioned  earlier  t lie  seg¬ 
mentation  part  of  Adiv’s  algorithm  was  unnecessary  be¬ 
cause  there  is  in  effect  only  one  moving  object.  The  mo¬ 
tion  parameters  produced  by  Adiv’s  algorithm  are  shown 
in  Table  (j. 

The  translation  results  are  all  the  same  except  for  the 
frame  pair  3  5  run  with  200  points.  The  result,  suggests 
that  Adiv’s  algorithm  finds  the  translation  vector  fairly 
well.  The  less  accurate  results  for  frame  pair  3  5  run  with 
200  points  are  because  of  false  matches  at  the  boundary 
of  the  image  aggravated  by  choosing  points  which  are  not 
distinctive. 

The  results  also  indicate  an  approximate  constant  ro¬ 
tation  of  about  0. 1  to  0.5  degrees  about  the  Y-axis,  a  small 
and  varying  component  about  the  X-axis,  and  a  random 
rotation  about,  the  /  axis.  This  is  consistent  with 


Table  6:  Motion  Parameters  obtained  through  the 
General  Motion  Algorithm.  The  frame  pairs  are  at 
4  ft.  intervals.  The  results  have  been  tabulated  for  100 
Moravec  points  and  200  Moravec  points.  (U,V,W)  is  the 
unit  translational  vector  and  (A,B,C)  are  the  rotational 
components  in  degrees 


1.  (x-axis):  a  road  surface  which  deviates  very  slightly 
from  being  planar  either  because  of  bumps  or  the  surface 
itself  being  non-planar. 

2.  (y-axis):  a  small  drift  of  the  vehicle  to  the  right. 

3.  (z-axis):  some  random  motion  probably  due  to  the 
vehicle  roll. 

However  ,  we  do  not  have  any  measurements  of  the  rota¬ 
tional  components  to  verify  how  correct  the  derived  rota¬ 
tional  parameters  arc. 

The  depth  results  (for  the  set  with  100  points)  are 
shown  in  Table  5  for  the  obstacles  on  the  road  (the  points 
are  labelled  in  Figure  0).  Since  several  points  are  miss¬ 
ing  in  the  case  of  the  can, for  the  can  we  also  show  depth 
values  when  200  interest  [joints  were  used.  In  general, 
these  arc  fairly  good  when  compared  to  other  attempts  at 
the  recovery  of  t,.e  depth  of  points  (average  error  roughly 
15%).  The  results  improve  for  some  of  the  later  frame- 
pairs.  Occasionally  absurd  depth  values  arc  returned  be¬ 
cause  of  false  matches;  this  is  particularly  noticeable  for 
cone  0  when  it  is  very  close  to  the  edge  of  the  image.  For 
points  which  are  at  a  distance,  the  depths  returned  are 
usually  reasonable  with  some  absurd  values  returned  be¬ 
cause  occlusion  causes  false  matches.  The  depth  values 
obtained  by  using  200  interest  [Joints  show  small  varia¬ 
tions  from  the  ones  in  table  5.  Since  in  most  cases  the 
translation  parameters  found  and  the  displacements  of  the 
points  are  the  same,  this  suggests  that  depth  compulation 
is  very  sensitive  to  the  accuracy  with  which  the  rotational 
parameters  are  computed. 

It  should  be  noted  that  in  view  of  the  results  reported 


by  earlier  researchers  regarding  structure  computations 
this  appears  to  be  a  very  promising  result.  It  also  ap¬ 
pears  as  if  this  technique  of  solving  the  motion  problem  is 
superior  to  the  methodologies  which  rely  on  approximate 
translational  algorithms.  It  should  also  be  noted  that  the 
displacement  vectors  found  by  Anandan’s  algorithm  are 
indeed  remarkable  as  can  be  seen  from  the  displacement 
vectors  shown  in  the  images  at  the  end  of  this  paper. 

6  Conclusion 

It  is  our  contention  that  the  computation  of  the  depth 
of  points  from  approximately  known  translational  motion 
is  difficult  to  determine  in  the  presence  of  even  small  ro¬ 
tations.  If  the  rotational  eiTects  are  not  removed,  large 
errors  in  the  location  of  the  FOE  can  occur,  and  the  ef¬ 
fect  upon  depth  recovery  can  be  disastrous.  One  scheme 
to  eliminate  small  rotations  by  using  distant  points,  which 
would  have  primarily  rotational  displacements  was  not  ef¬ 
fective.  While  each  of  the  algorithms  may  work  well  in 
isolation  from  each  other,  a  global  system  is  much  more 
difficult  to  construct.  The  cumulative  errors  from  the 
subsystems  tend  to  propagate  to  give  disastrous  results 
on  many  occasions.  The  algorithms  developed  by  Anan- 
dan  for  computing  displacements,  and  Adiv  for  comput¬ 
ing  the  parameters  of  general  sensor  motion  and  depth, 
were  applied  to  the  same  image  sequence.  The  results 
appear  to  be  quite  good,  although  ground  truth  for  sen¬ 
sor  motion  was  not  available.  The  extracted  translational 
motion  across  five  pairs  of  frames  (20  feet)  in  the  environ¬ 
ment  was  constant,  while  the  rotation  was  computed  to 
be  small;  (we  note  that  no  ground  truth  was  available  for 
the  rotational  components).  The  patterns  of  these  rota¬ 
tions  seems  quite  reasonable  for  the  context  of  the  vehicle 
moving  on  a  planar  path.  The  extraction  of  depth  from 
many  of  the  objects  was  reasonably  good,  and  sometimes 
surprisingly  accurate  at  distances  of  40  feet  or  greater. 

The  use  of  a  gyroscopically  stabilized  platform  would 
allow  the  simpler  translational  algorithms  to  be  applied 
in  extracting  the  depth  of  points.  Translational  motion 
processing  should  be  computationally  more  efficient  and 
effective.  As  an  alternative  to  the  algorithms  described  in 
this  paper,  a  variety  of  other  methods  have  been  proposed 
to  determine  depth  and  motion  parameter  from  image  se¬ 
quences.  Accurate  depth  determination  through  meth¬ 
ods  based  only  on  stereo  are  probably  difficult  to  apply 
from  a  moving  vehicle  because  of  the  rather  small  base¬ 
line.  However,  a  combination  of  stereo  and  motion  has 
been  proposed  [23,24,25|,  including  recent  research  from 
our  group  and  presented  in  this  volume  [20 j ,  in  a  form 
that  might  provide  additional  constraints  to  be  effective. 
Alternatively  the  detection  of  features  like  lines  might  be 
more  effective  in  depth  computation.  Current  research 
(also  presented  in  this  volume)  describe  token-based  tech 
niques  for  computing  correspondence  of  linear  structures 


Figure  6:  The  interest  points  marked  on  frame  1  of  the  image  sequence 


over  time  in  order  to  extract  depth  from  looming  [27,28]. 
Initial  experimental  results  have  shown  cases  where  depth 
extraction  has  been  quite  accurate. 

In  summary,  there  arc  currently  no  practical  motion 
systems  providing  robust  and  accurate  determination  of 
depth  in  real  scenes.  However,  there  arc  promising  direc¬ 
tions  actively  being  explored.  Nevertheless,  it  will  require 
considerable  effort  to  develop  a  working  system  which  will 
perform  robustly  in  view  of  many  of  t lie  problems  that  oc¬ 
cur  in  real  world  situations. 
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Abstract 

The  recent  work  on  perception  and  measurement  of  visual 
motion  has  consistently  advocated  the  use  of  hierarchical  repre¬ 
sentation  and  analysis.  In  most  of  the  practical  applications  of 
motion  perception,  such  as  robotics,  object  tracking  in  space  ap¬ 
plications,  etc.,  it  is  absolutely  necessary  to  be  able  to  construct 
and  process  these  hierarchical  image  representations  in  real  time, 
and  this  calls  for  an  inherently  parallel  implementation.  In  this 
paper,  first  we  briefly  review  the  past  work  on  motion  percep¬ 
tion  that  makes  use  of  hierarchical  models  and  identify  the  need 
for  their  real  time  implementation.  We  then  describe  a  new 
hierarchical  model  for  extraction  of  one  form  of  motion  informa¬ 
tion  from  an  image  sequence,  namely  optic-flow,  based  on  the 
recently  proposed  spatiotemporal  frequency  approach.  We  then 
identify  the  inherent  parallelism  in  the  model  and  show  how  to 
exploit  that  to  implement  the  model  in  real-time  on  the  PIPE 
image  processing  architecture.  In  the  implementation,  we  con¬ 
struct  the  underlying  image  pyramids  and  compute  the  optic 
flow  in  real  time.  Finally  we  present  the  results  of  the  PIPE  im¬ 
plementation.  They  are  also  available  on  a  video-tape  to  display 
the  real-time  nature  of  the  computations. 

1  Introduction 

Many  basic  image  processing  and  scene  analysis  operations  can 
be  performed  more  efficiently  and  robustly  using  multiresolu¬ 
tion  representations.  Because  of  their  hierarchical  organization 
and  local  interconnection  patterns,  they  form  the  basis  of  very 
simple  and  efficient  algorithms  for  low-level  image  processing 
operations  and  thus  render  substantial  robustness  to  the  higher- 
level  algorithms  via  the  coarse- to- fine  strategies.  Recent  studies 
in  psychology  reveal  that  pyramid  is  a  good  model  for  early 
stages  of  human  vision.  Several  such  multi-resolution  repre¬ 
sentations  have  been  described  in  literature  [Bur84],  [Mal87], 
[Cro84],  [Shn84],  Our  interest  in  pyramid-representations  has 
been  because  of  their  promise  in  the  area  of  detection  and  mea¬ 
surement  of  visual  motion  in  the  form  of  optic  flow  [Wax84]. 
Optic  flow  depicts  the  2-D  projections  of  the  3-D  scene  veloc¬ 
ities  on  the  projection  surface  Recent  developments  in  optic 
flow  computation  have  begun  to  identify  the  potential  of  hier- 
achical  representations  and  processing.  Glazer  [Gla87]  exten¬ 
sively  investigated  the  idea  of  using  hierarchical  schemes  for 
motion  measurement  Recent  work  by  Anandan  (Ana86c!  uses 
Burt’s  [Bur84]  Laplacian  pyramid  for  hierarchical  correlation 
based  matching  to  compute  the  optic-flow  over  two  successive 
image  frames.  Heeger  [IIee87b],  in  his  recent  work  on  optic  flow 
using  spatiotemporal  filtering  alludes  to  the  possibility  of  using 
multi-resolution  schemes  for  more  robust  results 


The  techniques  described  in  the  literature  for  determina¬ 
tion  of  optic-flow  [Sin87c]  lie  in  the  following  three  basic  cate¬ 
gories:  intensity  gradient  and  difference  based  methods  such  as 
the  classic  works  by  Horn  and  Schunck  [HS81],  Thompson  and 
Barnard  [TB87]  etc.;  intensity  correlation  based  methods  such 
as  the  recently  proposed  model  by  Anandan  [Ana86a]  and;  spa¬ 
tiotemporal  frequency  based  methods  that  formulate  the  problem 
of  optic-flow  computation  as  that  of  detecting  oriented  edges  in 
space-time,  such  as  recent  works  by  Adelson  and  Bergen  [AB85], 
Heeger  [Hee87bj. 

The  objective  of  this  paper  is  three  fold.  Firstly,  we  show, 
with  examples  from  past  research  that  there  is  a  potential  promise 
of  hierarchical  schemes  in  at  least  the  last  two  of  the  abovemen- 
tioned  three  categories  of  methods  for  optic-flow  computation. 
Secondly  we  propose  a  new  hierarchical  computational  model 
that  lies  in  the  third  category  described  above.  It  uses  a  simple 
pyramid  scheme  which  we  call  the  pyramid  of  oriented  edges , 
or  a  POE,  to  decompose  the  incoming  image  sequence  into  ori¬ 
ented  spatiotemporal  frequency  channels  that  can  be  used  to 
determine  optic  flow.  The  POE  is  described  as  a  logical  ex¬ 
tension  of  Burt’s  pyramid.  We  also  compare  our  POE  to  the 
functionally  (but  not  computationally)  similar  Mallat’s  pyra¬ 
mid  [Mal87],  that  has  shown  promise  in  optic  flow  computation 
[Hee87a]. Thirdly,  we  identify  that  for  these  approaches  to  be 
practically  applicable,  it  is  of  utmost  importance  to  be  able  to 
construct  the  underlying  image  structures,  namely  image  pyra¬ 
mids,  and  compute  optic  flow  in  real  time.  The  practical  issues 
of  real  time  computation  of  optic-flow  have  not  been  investigated 
in  the  past  research.  We  show  that  by  exploiting  the  inherent 
spatial  and  temporal  parallelism  in  the  model,  we  can  come  up 
with  pixel  parallel  algorithms  that  can  be  conveniently  imple¬ 
mented  on  the  state  of  the  art  image  processing  architectures 
such  as  the  PIPE  [The87|.  We  show  results  of  real-time  imple¬ 
mentation  of  the  POE  as  well  as  the  optic-flow  derived  from  it, 
on  the  PIPE  system.  For  the  purpose  of  comparison,  we  also 
show  implementation  of  Mallat’s  pyramid  discussed  above. 

The  organization  of  this  paper  is  as  follows.  The  three  ob¬ 
jectives  described  above  are  covered  in  Sections  2,  3  and  4  re¬ 
spectively.  Section  5  summarizes  this  work  and  enumerates  the 
areas  in  which  there  is  scope  for  further  work  along  these  lines. 
It  particularly  emphasizes  the  need  for,  and  a  possible  approach 
to,  the  integration  of  the  two  optic-flow  models  mentioned  above. 


see  [WA83],  [AA85]  that  the  spatial  and  temporal  frequencies 
are  related  as: 
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2  Optic  Flow  Determination  -  A  Case  for 
Hierarchical  Models 

In  the  previous  section  we  had  alluded  to  the  fact  that  hierar¬ 
chical  schemes  had  a  potential  application  to  at  least  the  last 
two  of  these  three  categories  namely  intensity  correlation  based 
methods  and  spatiotemporal  frequency  based  methods.  In  this 
section  we  elaborate  on  this  observation.  The  purpose  is  not 
merely  to  provide  a  review.  The  purpose  is  to  illustrate  two 
different  approaches  whose  computational  model  can  be  built 
around  a  common  image  structure,  namely,  some  image  pyra¬ 
mid.  This  common  image  structure  can  be  used  as  the  tool  to 
integrate  the  two  approaches  to  get  a  more  robust  model.  We 
discuss  the  issue  of  integration  as  a  part  of  our  ongoing  work  in 
the  concluding  remarks  at  the  end  of  the  paper. 

Intensity-correlation  based  techniques  essentially  use  two  suc¬ 
cessive  images  from  an  image  sequence.  They  use  windows  of 
defined  size  around  all  (or  some  selected)  pixels  in  the  first  image 
to  look  for  a  pixel  in  the  second  image,  which  has  a  similar  inten¬ 
sity  structure  in  the  window  around  it.  Vaguely  speaking  these 
techniques  search  for  “matching”  pixels.  The  matching  process 
suffers  from  a  combinatorial  explosion. The  effectiveness  of  cor¬ 
relation  based  techniques  can  be  improved  by  using  a  hierarchi¬ 
cal  search  strategy  to  avert  the  combinatorial  explosion.  This 
technique  was  successfully  used  by  Anandan  [Ana86b],  He  used 
Burt’s  Laplacian  pyramid  in  the  matching  process.  Spatiotempo¬ 
ral  Frequency  Based  Methods  have  recently  shown  a  significant 
promise  in  the  problem  of  motion  perception  [AB85],  [AA85]. 
Heeger  [Hee87b],  in  his  recent  work  based  on  spatiotemporal- 
frequency  approach,  made  use  of  gabor  filters  for  determina¬ 
tion  of  optic-flow.  He  alluded  in  his  work  that  a  multiresolution 
scheme  such  as  Burt’s  pyramid  can  be  used  to  construct  families 
of  gabor  filters  that  will  give  a  more  robust  estimate  of  optic-flow 
as  compared  to  the  current  version  of  his  model,  that  uses  only 
one  such  family  of  gabor  filters.  He  has  also  indicated  [Hee87a] 
the  usefulness  of  another  pyramid  representation,  namely  Mal- 
lat’s  pyramid  [Mal87],  in  implementing  his  model.  A  very  brief 
functional  description  of  Mallat’s  pyramid  and  its  applicability 
to  optic-flow  determination  will  be  given  elsewhere  in  this  pa¬ 
per.  Rather  than  reviewing  Heeger’s  model  in  order  to  show 
the  promise  of  hierarchical  schemes  to  spatiotemporal  frequency 
based  methods,  we  will  defer  this  to  the  next  section,  where, 
after  a  systematic  description  of  motion  in  spatiotemporal  fre¬ 
quency  domain,  we  will  actually  describe  a  new  hierarchical  com¬ 
putational  model  that  can  be  used  to  compute  optic-flow  based 
on  spatiotemporal  frequency  channels. 

3  Pyramid  of  Oriented  Edges:  A  Hierar¬ 
chical  Model  for  Spatiotemporal  Fre¬ 
quency  Based  Optic  Flow  Computation 

3.1  Motion  In  Frequency  Domain 

An  image  is  essentially  a  two  dimensional  signal  Likewise,  a 
sequence  of  images  is  a  three  dimensional  signal  It  is  easy  to 


u>(  =  u>zUz  +  WylJy . (Eqn.l) 

Here,  uit  is  the  temporal  frequency  and  u>z  and  tu„  are  the  spatial 
freouencies.  in  x  and  y  direction  respectively,  in  the  neighbor¬ 
hood  of  a  point  in  the  image.  Also,  Uz  and  Uv  are  the  image 
velocity  (optic  flow)  components  in  x  and  y  direction.  Eqn. 
1  suggests  a  technique  for  recovering  the  image  velocity.  The 
mechanism  would  have  to  “detect”  spatial  and  temporal  fre¬ 
quencies  in  the  sequence  of  images,  making  use  of  small  spatial 
and  temporal  neighborhoods  and  use  them  to  compute  the  im¬ 
age  velocities.  In  order  to  further  fix  this  idea  we  show  how  this 
follows  directly  from  the  conventional  space-time  formulation  of 
motion. 

3.2  Space-Time  Domain  v/s  Spatiotemporal  Fre¬ 
quency  Domain:  What’s  the  Connection  ? 

A  very  concise,  but  perhaps  not  very  comprehensive  way  to  an¬ 
swer  this  question  is  tc  use  Fourier  transform  based  analysis.  We 
will,  on  the  other  hand,  take  a  a  very  qualitative  approach  to 
investigate  this  issue.  For  this  purpose,  we  will  make  extensive 
use  of  the  example  of  vertical  bar  moving  in  a  horizontal  direc¬ 
tion  used  by  Adelson  and  Bergen  [AB85].  This  example  consists 
of  a  vertical  bar  moving  to  right  as  shown  in  Figure  la  (after 
Adelson).  Figure  lb  shows  a  three  dimensional  spatiotemporal 
diagram  of  the  moving  bar,  that  shows  up  as  a  slanted  slab  in 
space-time.  Figure  lc  shows  a  cross-section  of  parallel  to  the  x-t 
plane  (the  y  dimension,  containing  no  useful  information,  can 
be  dropped).  Figure  2a  (also  after  Adelson)  shows  the  same  x-t 
slice  for  various  values  of  the  velocity.  It  can  be  seen  that  the  ve¬ 
locity  can  be  measured  by  measuring  the  slope  of  the  edge  in  the 
x-t  plane  -  higher  the  velocity,  larger  the  slope.,  Thus,  motion 
detection  involves  detecting  an  edge  in  space-time  and  motion 
measurement  involves  measuring  the  slope  of  this  edge.  Figure 
2b  shows,  a  simplistic  (and  perhaps  non-realizable)  approach  to 
detecting  a  spatiotemporally  oriented  edge  and  (simultaneously) 
determining  its  slope.  It  essentially  requires  a  spatiotemporally 
oriented  filter,  with  positive  and  negative  weighting  factors  as 
shown  in  the  figure1. 

With  the  foregoing  discussion  on  motion  perception  as  edge 
detection  in  space-time,  we  have  set  the  stage  for  looking  at 
the  connection  between  motion  as  viewed  in  space-time  and  as 
viewed  in  spatiotemporal  frequency  domain.  As  depicted  in  Fig¬ 
ure  2b,  what’s  needed  to  locate  an  oriented  edge  in  space-time 
is  a  spatiotemporal  filter  with  its  impulse  response  oriented  in 
space-time.  Every  filter  has  a  distinct  frequency  response  and 
so  does  this  hypothetical  filter.  A  typical  frequency  response  of 
such  a  filter  is  shown  in  Figure  3.  It  is  evident  that  that  the 
filter  is  tuned  to  a  band  of  spatial  and  temporal  frequencies. 
This  can  now  be  seen  in  light  of  a  2-d  counterpart2of  Eqn.  1 
If  we  have  an  infinity  of  filters,  each  tuned  'o  distinct  combina¬ 
tion  of  ujx  and  uq,  only  one  of  these  filters  will  give  “significant” 
response  when  stimulated  with  a  vertical  sine-wave  grating  mov¬ 
ing  in  a  horizontal  direction.  Knowing  ojz  and  Wj  of  this  filter. 

It  is  noteworthy  that  in  an  ideal  case,  we  will  need  an  infinity  of  these 
filters,  each  having  a  different  orientation,  because  the  moving  bar  can  ide¬ 
ally  have  any  velocity,  resulting  into  an  edge  in  x-t  plane  that  can  have  any 
arbitrary  slope. 
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the  velocity  can  be  trivially  computed. 

3.3  Back  to  Equation  1:  Ideal  versus  Practical 
Approach 

In  this  section  we  discuss  the  quantitative  approach  to  using 
Eqn.  1  for  computation  of  the  optic  flow  for  a  2-D  translating 
pattern.  We  will  first  show  the  most  ideal  setting  to  extract 
the  optic  flow  and  then  develop  a  practical  model.  In  both  the 
cases,  for  the  purpose  of  simplicity,  we  begin  the  analysis  with 
a  case  where  the  image  motion  is  in  only  one  spatial  direction 
(x-direction,  as  in  Eqn.  2)  and  then  extend  the  model  to  general 
x-y  motion.  The  assumptions  on  which  the  model  is  based  are: 

•  The  sequence  of  images  for  which  the  optic-flow  is  to  be 
determined  contains  sufficient  texture.  This  assumption 
is  particularly  valid  in  the  real  life  scenes.  This  assump¬ 
tion  guarantees  that  the  aperture  problem  does  not  exist. 
It  will  be  shown  later  how  to  extend  the  model  to  work 
in  scenes  where  not  enough  texture  is  available  and  the 
aperture  problem  is  inherently  present. 

•  All  motion  is  locally  translational.  Thai  is,  in  a  small 
spatial  neighborhood  in  the  image,  the  image  motion  can 
be  assumed  to  be  pure  translation. 

•  All  the  analytical  treatment  to  follow  applies  to  each  point 
on  the  image.  Thus,  the  model  that  will  be  developed  will 
consider  a  local  neighborhood  of  each  point  (pixel)  on  the 
image  to  compute  the  image  velocity  at  the  point  under 
consideration.  This,  iterated  (or  done  in  parallel)  over  all 
points  on  the  image  gives  a  dense  optic-flow  array. 

The  Ideal  Case:  Consider  a  pattern  translating  in  the  x- 
direction.  The  spatial  and  temporal  frequencies  must  satisfy 
the  relationship  shown  in  Eqn.  2.  This  shows  that  the  velocity 
of  translation  is  equal  to  the  slope  of  the  line  passing  through 
the  origin  and  the  point  (u,,  uij)  in  spatiotemporal  frequency 
plane.  In  the  Fourierjargon.it  is  customary  to  characterize  this 
by  saying  that  the  spatiotemporal  energy  of  the  translating  pat¬ 
tern  is  concentrated  along  a  line  of  slope  Uz  in  the  (aiz,  uit) 
plane.  The  intent,  therefore,  is  to  find  the  slope  of  the  line  in 
the  [uiz,  ujt)  plane  along  which  all  the  motion  energy  is  concen¬ 
trated.  A  possible,  but  impractical  approach  to  doing  this  is 
shown  in  Figure  4a  The  mechanism  consists  of  an  infinity  of 
band-pass  (or  rather  point-pass)  filters  arranged  along  a  circle, 
each  filter  tuned  to  a  unique  combination  of  uiz  and  <u<.  Thus  if 
the  incoming  sequence  of  images  is  convolved  with  each  one  of 
these  filters,  only  one  of  them  will  have  a  non  zero  energy  in  it, 
the  one  that  lies  on  the  line  with  a  slope  of  Ux.  This  slope  is 
immediately  known. 

This  discussion  can  easily  be  extended  to  a  2-D  motion. 
In  this  case,  Eqn.  1  defines  the  relationship  between  the  spatial 
and  temporal  frequencies  of  the  translating  pattern.  It  is  evident 
again,  that  the  orthogonal  components  Uz  and  (/„  of  the  velocity 
of  translation  are  the  direction  cosines  of  a  plane  in  the  (c oz,  u>u, 

This  relate?  tv,  and  -■[  The  3-D  formulation  when  one  temporal  and 
two  spatial  dimension?  are  considered  is  slightly  different  and  we  will  discus? 
it  later. 


oil)  space,  that  satisfies  Eqn.  1.  In  the  Fourier  jargon  again,  the 
spatiotemporal  energy  of  the  translating  pattern  is  concentrated 
along  a  plane  in  the  (w„  w„,  ait)  space,  with  direction  cosines 
of  its  normal  as  Uz  and  Uv.  The  intent  therefore  is  to  locate  at 
least  two  points  on  the  plane  so  that  the  plane  is  uniquelv  known 
(actually  three  points  on  a  plane  are  required  to  determine  the 
plane  uniquely,  but  in  our  case,  the  third  point  is  the  origin  of 
the  (ujz,  ojv,  uit)  space,  and  is  known).  Again,  Figure  4b  shows  a 
possible,  but  impractical  way  to  identify  the  energy  plane.  This 
consists  of  an  infinity  of  frequency  selective  filters,  each  tuned  to 
a  unique  combination  of  uix ,  and  uit ,  placed  all  over  an  infinite 
cylinder  centered  around  the  origin  and  coaxial  with  the  or,  axis. 
If  the  incoming  sequence  of  images  is  convolved  with  each  one 
of  these  filters,  only  two  of  them  will  have  a  non-zero  motion 
energy  content.  Since  they  lie  on  the  motion-plane  the  direction 
cosines  of  the  normal  to  the  plane  are  immediately  known  and, 
thus  the  velocity  at  the  point  of  translation  at  the  point  under 
consideration  is  known. 

The  Practical  Case:  The  foregoing  approach  is  obviously 
impractical  for  two  reasons. 

•  It  is  impossible  to  design  filters  tuned  a  "  point”  in  the 
3-frequency  space.  As  we  will  see  in  the  next  section,  it 
is,  however,  possible  to  design  filters  selective  to  a  nar¬ 
row  band  of  frequencies.  They  have  maximum  response  at 
the  center-frequency  of  the  filter  that  falls  of  in  a  known 
fashion,  depending  on  the  specifications  of  the  filter  as  the 
frequency  deviates  from  the  center  frequency.  A  typical 
example  of  the  frequency  response  of  such  a  filter  is  shown 
in  Figure  3. 

•  It  is  not  possible  to  have  an  infinite  number  of  frequency 
selective  filters.  A  practical  approach  is  to  have  a  small 
number  of  filters  sensitive  to  the  oriented  bands  of  spa¬ 
tiotemporal  frequencies  and  based  on  the  relative  energy 
contained  in  each  of  these  bands,  find  an  approximate  slope 
of  the  motion-line  (or  the  direction  cosines  of  the  normal 
to  the  motion  plane).  This  is  explained  below  for  a  1-D 
motion  case. 

The  practical  approach  suggested  in  the  foregoing  discus¬ 
sion  has  been  reallized  in  Figure  5a.  A  similar,  but  not  identical 
approach  was  suggested  by  Heeger  [Hee87b],  This  consists  of 
five  filters  placed  along  a  line  parallel  to  the  uit  axis.  The  power- 
spectrum  of  each  of  these  filters  is  a  Gaussian  standing  at  the 
center  frequency  of  the  filter  in  the  (u >z,uit)  plane.  These  Gaus- 
sians  are  represented  by  concentric  circles  of  diminishing  radial 
density  as  one  moves  farther  away  from  the  center  frequency. 
Evidently  each  one  of  these  filters  will  give  a  non-zero  energy 
when  convolved  with  the  incoming  sequence  of  images.  The 
filter  whose  center  frequency  is  closest  to  the  motion-line  will 
give  maximum  response  and  the  one  farthest  will  give  minimum 
response. 

The  energy  outputs  can  be  plotted  against  the  temporal  fre¬ 
quency  (the  spatial  frequency  being  constant)  as  shown  in  Figure 
5b.  If  a  smooth  curve  is  drawn  to  interpolate  (or  approximate, 
under  some  minimum  error  criterion)  the  five  known  points,  the 
maximum  of  the  curve  will  correspond  to  a  filter  whose  renter 
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frequency  will  lie  exactly  on  the  motion  line.  Thus  the  velocity 
of  translation  is  known  at  the  point  under  consideration.  Having 
investigated  the  principle  underlying  the  extraction  of  optic  flow 
from  a  finite  number  of  spatiotemporally  oriented  filters,  we  have 
set  the  stage  for  describing  the  pyramid  of  oriented  edges  that 
provides  a  simple  scheme  for  getting  a  family  of  spatiotemporally 
orient  'd  filters,  the  basic  hardware  for  optic  flow  extraction. 

3.4  The  Pyramid  of  Oriented  Edges:  A  Family  of 
Spatiotemporally  Oriented  Filters 

It  was  clearly  brought  out  in  the  previous  paragraphs  that  filters 
with  spatiotemorally  oriented  impulse  response  are  sensitive  to 
motion  information  in  an  image  sequence.  Henceforth,  in  the 
ensuing  discussion,  these  spatiotemporally  oriented  filters  will 
be  referred  to  as  motion  sensitive  filters  also,  for  obvious  rea¬ 
sons.  What  is  needed  to  detect  and  measure  motion  is  a  set  of 
motion  sensitive  filters  strategically  located  in  the  3-frequency 
plane  and  a  set  of  procedures  to  extract  optic-flow  from  the  rela¬ 
tive  magnitude  of  the  energy  content  of  these  filters,  respectively. 
Over  the  past  few  years,  researchers  in  physiology  as  well  as  com¬ 
putational  vision  have  proposed  a  variety  of  motion  sensitive  fil¬ 
ters.  With  a  very  few  exceptions,  most  of  the  reported  research 
[WA83],  [FJ85],  [AA85],  [AB85]  [MU81]  describes  only  how  to 
construct  motion  sensitive  filters,  without  a  detailed  treatment 
of  how  to  extract  optic-flow,  given  the  motion  sensitive  filters. 
We  describe  below  a  new  image  structure,  that  we  call  a  Pyramid 
of  Oriented  Edges  or  a  POE,  that  provides  a  family  of  filters, 
each  member  of  which  occupies  a  band  of  spatiotemporal  fre¬ 
quencies  that  is  oriented  in  the  spatiotemporal-frequency  space. 
Seen  in  the  space-time  domain,  each  member  of  this  family  of 
filters  is  sensitive  to  an  edge  with  a  specific  orientation  in  space- 
time  and  hence  is  motion-sensitive. 

To  keep  the  discussion  simple,  we  will  first  describe  a  2-D 
version  of  the  POE.  This  version  is  is  essentially  a  spatial  pyra¬ 
mid  with  no  temporal  dimension.  This  version  of  the  model 
works  on  a  single  image  and  decomposes  it  into  several  “chan¬ 
nels”  ,  each  channel  sensitive  to  spatial  edge  features  in  a  specific 
direction.  Functionally,  therefore,  each  channel  is  an  orientation 
selective  filter  in  space  only.  We  will  discuss  the  issues  involved 
in  an  extension  of  the  2-D  POE  to  incorporate  the  temporal 
dimension  in  the  end.  The  3-D  POE  thus  obtained  will  have 
channels  selective  to  oriented  edges  in  space  time  and  hence, 
will  be  motion  sensitive. 

We  explain  the  computational  approach  taken  in  the  model 
using  a  specific  instance  that  has  three  channels  at  each  level 
of  the  pyramid,  ser.iitive  to  edge  features  in  horizontal,  vertical 
and  diagonal  directions  as  shown  in  Figure  6.  The  computa¬ 
tional  scheme  to  construct  the  POE  is  a  very  logical  extension 
of  Burt’s  pyramid  representations.  A  brief  review  of  Burt’s  rep¬ 
resentation  and  its  proposed  extension  is  given  below  in  a  qual¬ 
itative  sense.  Burt’s  representations  consists  of  two  pyramids, 
namely  Gaussian  pyramid  and  Laplacian  pyramid,  Gaussian 
pyramid  contains  low-pass  filtered  copies  of  the  original  image, 
at  successively  decreasing  resolutions.  Laplacian  pyramid,  on 
the  other  hand,  contains  band-pass  filtered  copies  of  the  original 
image.  A  representative  example  is  shown  in  Figure  7a  (after 
Burt). 


The  computations  involved  in  constructing,  say,  L,  and  6fi+i, 
given  Gi  (the  original  image  being  Go),  can  be  visualized  easily 
from  Figure  7b.  In  this  figure,  H  and  V  refer  to  Gaussian  filter 
in  the  horizontal  and  vertical  direction,  as  defined  by  Burt.  The 
characteristic  property  of  these  filters  that  is  of  relevance  here  is 
that  the  spatial  frequency  content  of  the  filtered  image  is  half  of 
that  of  the  input  image.  Thus,  for  example,  if  an  image  is  con¬ 
volved  with  H,  the  frequency  content  of  the  filtered  image,  in  the 
horizontal  direction,  is  reduced  to  half  its  original  value.  This  is 
depicted  in  Figure  7b(i).  Similarly,  Figure  7b(ii)  shows  an  image 
(Gi)  filtered  with  H  and  V  successively.  Since  the  overall  fre¬ 
quency  content  of  the  resultant  image  is  half  the  original  value 
it  is  justified  to  decimate  the  resultant  image  by  sampling  every 
other  pixel,  both  along  rows  and  columns,  without  introducing 
any  aliasing.  This  decimated  image  is  G i+j.  This  decimated 
image  can  be  expanded  to  twice  its  current  size  by  an  EXPAND 
operation  described  later  in  this  section.  If  we  subtract  the  EX- 
PANDed  version  of  G,+i  from  G,,  what  happens  in  terms  of  the 
frequency  content  is  displayed  in  Figure  7b(iii).  It  is  easy  to  see 
that  the  resultant  image  is  nothing  but  L,,  a  band-pass-filtered 
image  depicting  the  edges. 

A  qualitative  description  of  the  extension  proposed  for  ori¬ 
entation  selectivity  can  be  easily  visualized  from  Figure  8.  Ba¬ 
sically  Li  is  convolved  with  H,  V  and  (1-H-V)  respectively.  The 
results  are  shown  in  Figure  8b,  8c  and  8d  respectively.  It  is  ap¬ 
parent  that  these  three  images  depict  the  spatial  edge  features 
oriented  in  horizontal,  vertical  and  diagonal  directions  respec¬ 
tively.  These  three  images,  thus  comprise  the  Leveli  of  the  POE. 
The  three  images  at  the  Leveli  of  the  POE,  which  we  denote  by 
LHi,  LVi  and  LDi  (  standing  for  images  depicting  horizontal, 
vertical  and  diagonal  intensity  gradients  respectively),  can  be 
quantitatively  described  as 

LHi  =  Li  «  H 

LVi  =  Li*  V 

LDi  =  Li-  Li*  H  -  Li*V 

The  extension  of  this  modelto  a  general  scheme  with  K  chan¬ 
nels  instead  of  the  three  described  here  is  not  included  here  for 

purpose  of  brevity.  The  interested  reader  is  refered  to  [SR87J  for 
details. 

However,  the  extension  of  the  2-D  POE  to  three  dimensions, 
to  include  the  temporal  processing  is  worth  mentioning.  We  have 
implemented  this  scheme  in  a  space-time  separable  fashion.  Each 
image  of  the  image-sequence  undergoes  spatial  band-pass  filter¬ 
ing  to  give  corresponding  image  sequences  that  depict  oriented 
edge  features.  Each  one  of  these  sequences  is  convolved  with 
a  difference  of  temporal  Gaussians  to  get  the  temporal  band¬ 
pass  effect,  the  net  effect  of  these  two  steps  of  processing  is 
a  family  of  spatiotemporal  band-pass  filters  that  are  sensitive 
to  oriented  edges  in  space-time,  or  alternatively,  that  are  mo¬ 
tion  sensitive.  This  is  schematically  depicted  in  Figure  9.  This 
seemingly  straightforward  extension  of  the  POE  to  include  the 
temporal  dimension  is  computationally  very  demanding.  Specif¬ 
ically,  to  do  it  in  real  time  also  calls  for  a  very  careful  synchro¬ 
nization  in  addition  to  the  increased  number  crunching.  This 
aspect  will  be  discussed  later  along  with  other  issues  of  real 
time  implementation. 
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4  Doing  it  in  Real  Time 

Most  of  the  practical  applications  of  a  computational  model  for 
motion  perception,  such  as  robotics  and  object  tracking  in  space 
applications  etc.,  inherently  call  for  a  real  time  implementation 
of  the  model.  This  aspect  has  largely  been  ignored  in  thepast 
research.  We  have  investigated  the  underlying  issues  and  im¬ 
plemented  our  model  in  real  time  on  the  PIPE  system.  It  is 
apparent  from  the  foregoing  discussion  that  the  nature  of  com¬ 
putations  involved  in  the  models  described  above  is  very  local 
in  both  space  and  time.  It  is  therefore  worthwhile  reinterpret¬ 
ing  the  basic  definitions  to  develop  their  pixel-parallel  versions 
that  can  then  be  mapped  on  a  variety  of  state  of  the  art  image 
processing  architectures  that  support  (i)  a  parallel  execution  of 
simple  operations  on  all  the  pixels  in  the  image  and  (ii)  concur¬ 
rent  temporal  processing. 

We  first  briefly  review  one  such  architecture,  namely  PIPE 
on  which  we  have  implemented  the  abovementioned  models  in 
real  time.  We  then  briefly  discuss  the  approach  to  developing 
the  “pixel-parallel”  version  of  Burt’s  pyramid  and  its  extension 
to  the  POE  and  contrast  it  with  that  of  Mallat’s  pyramid3.  For 
a  quick  reference,  the  algorithm  to  construct  Mallat’s  pyramid  is 
shown  pictorially  in  Figure  10.  The  interested  reader  is  referred 
to  Mallat’s  original  work  for  details. 

More  importantly,  we  show  how  to  include  temporal  pro¬ 
cessing  in  the  POE  in  the  pixel-parallel  versions  using  the  space- 
time  separable  strategy  described  before.  The  pixel-paraiiel  ver¬ 
sions  map  directly  onto  the  PIPE  architecture.  We  then  show 
the  results  of  the  PIPE  implementation.  The  results  are  also 
available  on  a  video-tape  to  display  the  real-time  nature  of  the 
results. 


4.1  A  Review  of  the  PIPE  Architecture 

PIPE  is  a  multi-stage  parallel  processor  designed  to  process  im¬ 
ages  at  video-rate.  Each  stage  in  the  system,  called  a  Modu¬ 
lar  processing  Stage  (MPS),  is  designed  so  that  all  input,  pro¬ 
cessing  and  output  are  completely  synchronous  with  the  video¬ 
raster  and  thus  allows  complete  image  to  be  treated  as  one  data- 
structure.  A  schematic  diagram  of  the  details  of  the  architec¬ 
tures  is  shown  in  Figure  9.  Figure  9a  shows  the  connectivity  of 
the  eight  stages.  There  is  a  forward  path  connecting  the  output 
of  each  stage  to  the  input  of  the  next  stage,  a  backward  path 
connecting  the  output  of  each  stage  to  the  input  of  the  previous 
stage  and  a  recursive  path  connecting  the  output  of  each  stage 
to  its  own  input.  In  addition,  there  are  six  video  buses  to  con¬ 
nect  the  output  of  any  stage  to  the  input  of  any  other.  Each 
of  these  data-paths  is  eight-bits  wide..  Images  can  be  made  to 
stream  between  stages,  spending  one  cycle  (l/60th  of  a  second) 
in  each  stage  for  processing.  Figure  9b  shows  the  capability 
available  within  each  stage.  The  hardware  modules  available  in 
each  stage  include  (i)  two  image  buffers,  256x256x8  bits  each, 
for  storing  images,  (ii)  two  neighborhood  operators  to  do  any  ar¬ 
bitrary  3x3  or  9x1  convolution  on  the  complete  image,  (iii)  four 
look-up  table  operators  to  do  any  arbitrary  point  transformation 
operation  on  the  complete  image,  such  as  multiplying  each  pixel 
in  the  image  by  2,  (iv)  three  ALu's  to  do  simple  operations 


on  two  images,  such  as  subtracting  one  image  from  the  other, 
pixel  by  pixel  and  (v)  a  Two  Valued  Function  Module  (TVF), 
that  is  a  very  powerful  tool  to  do  any  arbitrary  operation  on 
two  images,  or  to  perform  arbitrary  image  warping  operation 
operations,  such  as  rotating  an  image  by  an  arbitrary  angle.  It 
needs  to  be  emphasized  that  all  the  operations  in  a  stage  can  be 
done  on  the  complete  image,  and  can  be  finished  in  l/60th  of  a 
second. 


4.2  Pixel  Parallel  Versions  of  the  Proposed  Mod¬ 
els 

The  key  issue  in  developing  pixel-parallel  versions  is  to  exploit 
the  spatial  and  temporal  parallelism  inherent  in  the  computation 
model.  For  purpose  of  brevity,  we  will  not  describe  the  complete 
algorithms  here.  Rather,  we  summarize  the  salient  points  here. 
The  interested  reader  is  referred  to  (Sin87a). 

The  spatial  operations  used  in  the  model  described  above 
comprise  convolutions  over  small  spatial  neighborhoods,  a  fixed 
unary  operation  on  all  pixels  of  an  image,  a  fixed  binary  opera¬ 
tion  on  all  the  corresponding  pixels  of  two  images,  subsampling 
to  reduce  image  size  or  pixel  replication  to  expand  image  size 
etc.  They  point  to  an  SIMD  approach  to  spatial  parallelism  that 
is  supported  by  the  PIPE,  as  described  before.  The  temporal  fil¬ 
tering  required  to  make  the  POE  channels  motion  sensitive  was 
described  earlier.  That  is  supported  very  well  by  the  PIPE’s 
interstage  connectivity.  The  forward,  backward  and  recursive 
paths  allow  the  spatial  processing  to  be  carried  out  in  three  suc¬ 
cessive  stages  and  then  the  three  consecutive  spatially  filtered 
images  can  be  combined  via  the  three  paths  described  above. 
This  concurrency  is  one  of  the  key  aspects  of  real  time  imple¬ 
mentation. 


4.3  Implementation  Results 

The  algorithms  described  above  were  implemented  on  the  Eight- 
Stage  PIPE.  The  results  of  the  implementation  are  discussed 
below. 

For  Burt’s  pyramid  and  the  POE,  the  current  implementa¬ 
tion  computes  three  levels  of  the  Gaussian  pyramid,  two  levels 
of  the  Laplacian  pyramid  and  one  level  of  the  2-D  POE.  quite 
acceptable.  It  takes  four  cycles  of  the  PIPE  to  do  all  these 
computations.  So  a  new  image  is  sampled  every  (l/15)th  of 
a  second,  i.e.,  at  one  fourth  of  the  standard  video  rate  and  the 
abovementioned  levels  of  the  pyramid  for  an  image  are  available, 
just  before  the  next  image  is  ready  to  be  sampled.  The  results 
are  shown  in  Figure  11a. 


3It  is  noteworthy  that  Mallat’s  pyramid  [Mal87]  is  functionally  identical 
to  the  spatxal  POE  with  three  channels  described  above.  It  however  uses 
quadrature  mirror  filter  pairs,  or  QMF  pairs  that  are  computationally  much 
more  expansive  than  the  simple  Gaussian  filter  used  in  the  POE.  QMF 
pairs,  however  provide  very  good  “tuning*  and  hence  their  3-D  extensions 
to  include  the  temporal  dimension  have  potential  application  to  optic-flow 
computation.  Heeger  [Hee87a]  considers  them  a  possible  replacement  for 
the  GaLvi  filters  he  in  h:i  inode*.  VVt  currently  investigating  the 
issue  of  extending  Mallat’s  pyramid  to  three  dimensions,  in  fashion  similar 
to  the  3-D  POE,  whose  implementation  has  been  discussed  above. 
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We  have  also  implemented  separately,  one  level  of  the  3-D 
spatiotemporal  POE  following  the  strategy  of  Figure  9.  This 
computes  optic-flow  by  generating  three  image  sequences  sensi¬ 
tive  to  image  motion  in  horizontal,  vertical  and  diagonal  direc¬ 
tions  respectively.  One  frame  from  the  incoming  image  sequence, 
and  one  frame  from  the  output  sequence,  depicting  the  image 
velocity  is  shown  in  Figure  11c.  The  scene  comprised  of  the 
UTAH-MIT  dextrous  hand  in  our  laboratory.  Only  two  of  the 
three  fingers  visible  in  the  input  image  (top  and  bottom  fingers) 
were  set  in  motion.  The  middle  finger  was  kept  stationary.  This 
is  apparent  in  the  output  image.  Where  the  contours  of  the  top 
and  the  bottom  fingers  depict  motion.  This  takes  three  PIPE 
cycles,  and  thus  a  new  optic-flow  is  available  every  l/20th  of  a 
second.  These  speeds  show  promise  for  real-time  object  tracking 
in  robotics.  We  are  currently  working  on  increasing  the  number 
of  direction  channels  in  optic  flow  computation  beyond  three  in 
the  current  version. 

For  Mallat’s  pyramid  we  have  implemented  a  version  that 
constructs  one  level  of  the  pyramid.  We  used  a  9x1  low-pass 
and  band-pass  QMF.  Since  a  9x1  convolution  is  currently  not 
available  on  the  PIPE,  it  was  implemented  using  the  3x3  masks 
by  the  parallel  technique  described  by  [Sin87b].  This  consumes  a 
lot  of  PIPE  cycles,  and  hence,  the  current  version  uses  20  cycles. 
Thus  a  new  image  is  sampled  every  one  third  of  a  second  and 
the  results  are  available  just  before  the  next  image  is  ready  to 
be  sampled.  However  it  has  been  calculated  that  with  the  9x1 
convolution  mask  available  in  hardware,  the  same  computation 
can  be  done  in  6  PIPE  cycles,  i.e.,  a  new  result  is  available  every 
one  tenth  of  a  second.  Figure  lib  shows  the  results  for  one  level 
of  Mallat’s  pyramid. 

All  these  results  are  available  on  a  video-tape  to  show  the 
real  time  nature 


5  Conclusion 

In  this  paper  we  have  made  a  strong  case  for  hierarchical  im¬ 
age  representations  and  processing  as  an  effective  tool  for  ro¬ 
bust  optic-flow  measurements.  We  have  identified  two  classes 
of  computational  models  in  which  they  can  be  applied  to  com¬ 
pute  optic-flow  from  a  sequence  of  images.  After  examining  this 
observation  in  light  of  the  past  research,  we  have  described  a 
new  hierarchical  image  structure,  the  pyramid  of  oriented  edges, 
that  can  be  used  to  generate  a  family  of  motion  sensitive  filters. 
Motion  sensitive  filters,  or  spatiotemporal  orientation  selective 
filters  have  been  identified  in  the  recent  research  as  the  basic 
hardware  for  optic-flow  computation  based  on  spatiotemporal 
frequency  approach.  We  have  also  emphasized  the  need  to  imple¬ 
ment  the  basic  structures  required  for  optic-flow  computation  in 
real  time.  For  this  purpose,  we  have  developed  pixel-parallel  ver¬ 
sions  of  two  hierarchical  schemes,  namely  our  own  POE  and  the 
functionally  similar  Mallat’s  pyramid  and  compare  their  time- 
performance  via  real-time  implementation  on  the  PIPE  archi¬ 
tecture.  We  have  implemented  the  3-D  version  of  the  POE  to 
compute  a  co»w  ft** rr>  of  optic-flow.  As  a  prerequisite  to  the 
PUE,  we  have  also  implemented  the  pixel-parallel  versions  of 
Burt’s  pyramid.  It  is  noteworthy  that  this  implementation  is 
useful  not  only  for  our  spatiotemporal  frequency  based  model, 


but  can  also  serve  as  a  “core”  for  the  real-time  implementation 
of  optic-flow  computation  schemes  based  on  intensity-correlation 
based  methods  such  as  that  of  Anandan. 

This  work  is  a  part  of  our  ongoing  effort  to  get  very  robust 
and  fast  measurements  of  optic-flow.  Our  plans  for  the  near  fu¬ 
ture  include  (i)  developing  a  scheme  on  the  lines  of  Figure  5b 
to  extract  the  true  optic  flow  from  the  motion  sensitive  filters 
obtained  with  the  POE,  (ii)  implement  two  different  optic-flow 
schemes,  namely  an  intensity  correlation  based  model  similar  to 
that  of  Anandan  and  our  own  POE  based  model,  around  the 
real-time  Burt’s  pyramid  and  the  POE  respectively  and  finally 
(iii)  integrate  the  above  two  models  to  get  a  robust  optic-flow 
scheme.  Intensity  correlation  methods  tend  to  work  very  well 
in  “structured  scenes”  whereas  spatiotemporal  frequency  based 
methods  are  more  suited  for  textured  scenes.  Their  integration 
can,  therefore,  be  a  potential  source  of  robustness  over  a  wide 
range  of  scenes. 
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ABSTRACT 

Motion  in  the  environment  manifests  itself  in  changes  of 
many  kinds,  not  just  image  plane  velocities,  yet  these  are  all 
that  an  optical  flow  field  makes  explicit.  We  view  optical  flow  as 
a  useful  low-level  representation  from  which  a  symbolic  descrip¬ 
tion  of  change,  in  the  form  of  token  matches,  can  be  computed. 
The  tokens  of  interest  to  us  are  those  produced  by  perceptual 
organization  processes,  and  are  more  abstract  than  edges  or  in¬ 
terest  points.  We  demonstrate  a  working  system  for  matching 
line  tokens  which  uses  the  optical  flow  field  in  a  heuristic  manner 
to  limit  the  search  for  the  minimal  bipartite  cover  of  the  set  of 
tokens  from  each  frame 


1.  INTRODUCTION 

It  is  our  position  that  the  inherently  local  measurement  of 
visual  motion  provided  by  optical  flow  is  insufficient  to  meet  the 
varied  requirements  of  dynamic  image  understanding.  We  choose 
to  describe  the  time  varying  image  by  computing  correspondence 
between  tokens  of  arbitrary  spatial  scale  produced  by  perceptual 
organization  processes.  We  believe  that  this  will  result  not  only 
in  more  accurate  measurement  of  visual  motion,  but  also  facil¬ 
itate  the  use  of  motion  information  in  object  recognition  and 
scene  understanding 

For  the  purposes  of  this  paper,  techniques  for  measuring  vi¬ 
sual  motion  can  be  roughly  categorized  as  either  optical  flow 
methods  or  token  matching  methods.  This  taxonomy  is  based 
upon  the  nature  of  the  output  representation.  Specifically,  a  dis¬ 
tinction  is  drawn  between  methods  whose  purpose  is  the  com¬ 
putation  of  a  velocity  or  displacement  field,  and  methods  which 
compute  correspondence  in  time  between  tokens  that  serve  as 
descriptors  of  spatial  structure. 

The  goal  of  the  optical  flow  methods  is  to  compute  a  vec¬ 
tor  function  of  the  image  plane.  Depending  on  the  particular 
method,  each  vector  represents  either  a  velocity  or  a  displace¬ 
ment.  The  typically  pixel-parallel  nature  of  the  computation  is 
dictated  largely  by  the  form  of  the  input  and  output  representa¬ 
tions,  which  are  arrays  of  values  in  registration  with  the  original 
scene.  Optical  flow  methods  have  been  more  successful  than  to¬ 
ken  matching  methods,  and  much  of  this  success  is  due  to  the 
fact  that  these  methods  implicitly  exploit  information  carried 
directly  in  the  “shape”  of  the  image  intensity  surface  This  is 
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usually  effected  through  the  use  of  an  intensity  constancy  con¬ 
straint  (e.g.  See  Horn  and  Schunk  [8] ) .  The  utility  of  optical  flow 
methods  has  been  further  increased  by  the  ease  with  which  they 
can  be  expressed  hierarchically,  resulting  in  faster  algorithms 
capable  of  describing  larger  motions  ( 1 ,13,5] . 

The  token  matching  methods  which  concern  us  compute  cor¬ 
respondence  in  time  between  spatial  structures  produced  by  group¬ 
ing  processes.  These  tokens  belong  to  what  Marr  has  called  the 
full  primal  sketch,  [10]  and  are  more  abstract  than  edge  segments 
[11]  or  interest  points  [2,12,15).  Tokens  map  directly  to  envi¬ 
ronmental  structure,  and  descriptions  of  their  movement  corre¬ 
late  more  closely  with  the  motion  of  physical  objects,  than  does 
optical  flow.  Most  importantly,  token  matching  allows  change 
through  time  to  be  expressed  in  a  wide  variety  of  ways.  A  token 
match  represents  more  than  a  spatial  displacement,  also  explicit 
are  the  changing  values  of  any  parameters  associated  with  the 
token.  These  can  include  orientation,  length,  area,  contrast, 
color,  etc.  In  short,  motion  in  the  environment  manifests  itself 
in  changes  of  many  kinds,  not  just  as  image  plane  velocities. 
Information  of  this  kind  can  facilitate  interpretation,  including 
computation  of  environmental  depth  (see  [I8|  in  these  proceed¬ 
ings).  While  this  information  is  explicit  in  the  output  of  a  token 
matching  method,  it  is  difficult  or  impossible  to  compute  from 
an  optical  flow  field 

While  spatial  structure  is  better  described  by  a  set  of  to¬ 
kens  than  by  an  array  of  values,  current  perceptual  organization 
processes  fail  to  adequately  describe  the  shape  of  the  image  in¬ 
tensity  surface  Indeed,  this  information  is  usually  intentionally 
discarded  during  the  abstraction  process.  However,  even  if  a 
sufficiently  powerful  descriptive  language  were  developed  (  e  g 
[3,6] ) ,  and  a  token  matching  approach  were  formulated,  it  is  dif¬ 
ficult  to  see  how  such  an  approach  could  rival  the  efficiency  and 
simplicity  of  methods  which  exploit  this  information  implicitly  , 
through  an  intensity  constancy  constraint. 

The  fact  that  a  representation  is  easy  to  compute  reveals 
nothing  about  its  utility.  Interpretation  of  an  optical  flow  field 
independent  of  any  knowledge  of  the  spatial  structure  from  which 

it  wras  derived,  seems  difficult  at  best.  Spatial  structure  can  be 
characterized  locally  with  an  interest  operator  [12]  and  interpre¬ 
tation  can  be  restricted  to  the  sparse  set  of  points  with  a  high 
interest  operator  score.  Alternatively,  local  structural  measures 
ran  be  incorporated  within  the  optical  flow  computation  itself, 
allowing  a  dense  flow  field  to  be  computed  through  a  smooth¬ 
ness  constraint.  Nagel  employs  a  second  order  approximation 
of  the  intensity  variation  to  determine  the  direction  in  which  a 
smoothness  constraint  is  enforced  [U],  Anandan  analyzes  the 
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principle  curvatures  of  the  sum-of-squared -difference  surface  to 
associate  vector  confidence  measures  with  each  displacement, 
which  in  turn,  indirectly  influence  the  enforcement  of  a  smooth¬ 
ness  constraint  [lj.  The  characterizations  of  spatial  structure  in 
the  techniques  of  Anandan  and  Nagel  are  simple  and  local,  as 
is  necessitated  by  the  need  to  incorporate  such  characterizations 
within  the  formulations  of  the  smoothness  constraints 

Recent  work  in  perceptual  organization  has  given  us  a  richer 
vocabulary  with  which  to  describe  spatial  structure  than  has 
been  available  in  the  past  [9,10,16,19].  While  local  operators 
are  useful  for  detecting  local  structures,  more  powerful  grouping 
processes  are  required  to  recognize  structure  of  arbitrary  scale. 
Since  the  image  structure  revealed  through  grouping  corresponds 
directly  with  environmental  structure  |19|,  perceptual  organiza¬ 
tion  provides  the  best  measure  of  “interest  .”  Recently,  Boldt.  has 
incorporated  these  ideas  in  a  working  program  for  grouping  long 
straight  line  segments  [4,17]. 

With  this  in  mind,  a  logical  course  of  action  might  be  to  en¬ 
force  a  smoothness  constraint  along  the  length  of  a  line  segment 
produced  by  a  perceptual  organization  process  such  as  Boldt  V 
In  fact,  Hildreth’s  smoothness  formulation  can  be  enforced  along 
an  arbitrary  contour,  and  for  the  specific  case  of  a  line  segment 
in  three  space  undergoing  rigid  motion,  she  demonstrates  that, 
it  yields  the  physically  correct  flow  [7].  If  our  goal  were  to  pro¬ 
duce  a  better  flow  field,  this  is  exactly  what,  we  would  do,  but 
it  has  already  been  suggested  that  knowledge  of  correspondence 
in  time  between  the  line  segments  themselves  would  result  in  a 
representation  more  useful  to  the  interpret  ation  task  We  choose 
to  view  the  optical  flow  field  as  a  useful  low-level  representation 
from  which  a  symbolic  description  of  change,  in  the  form  of  token 
matches,  can  be  computed. 

2.  GROUPING  FAILURE 

As  tokens  become  more  abstract,  they  also  become  more 
unique.  Therefore,  it  is  often  assumed  that  the  more  abstract 
a  token  is,  the  less  ambiguous  matching  will  he  To  a  certain 
extent,  this  is  true,  but  increased  abstraction  brings  with  it  a 
new  source  of  ambiguity.  Under  ideal  conditions,  we  might  ex¬ 
pect  two  perceptual  organization  processes  operating  indepen¬ 
dently  on  two  frames  of  a  motion  sequence  to  partition  each 
frame  into  the  same  set  of  tokens.  After  all,  each  frame  is  a 
slightly  different  view  of  a  predominantly  stable  physical  world 
separated  only  slightly  in  time.  Unfortunately,  in  practice,  due 
to  the  discrete  sampling  of  image  formation  (and  other  image 
effects  such  as  shadows,  highlights,  and  texture),  the  likelihood 
of  two  frames  being  partitioned  into  the  same  set  of  tokens  is 
very  small.  Since  the  basic  operation  employed  in  perceptual 
organization  is  grouping,  discrepencies  can  arise  in  two  ways:  I) 
failure  to  relate  two  tokens  that  have  a  single  physical  cause,  or 
undergrouping  ami  2)  mistakingly  relating  two  tokens  that  have 
separate  physical  causes,  or  overgrouping.  While  both  under- 
grouping  and  overgrouping  can  be  caused  by  noise,  overgrouping 
errors  most  often  occur  when  t.wo  tokens  satisfy  the  geometric 
criteria  for  grouping  through  pure  chance  Unfortunately,  this 
happens  more  frequently  in  motion  sequences  depicting  the  view 
of  the  world  from  the  vantage  point,  of  a  moving  sensor.  While 
the  odds  of  two  unrelated  tokens  accidently  satisfying  the  group* 
ing  criteria  in  any  single  view  are  small,  the  odds  of  the  moving 
sensor  passing  through  such  degenerate  vipws  in  the  course  of  a 
motion  sequence  are  much  higher  The  solution  to  this  problem 
is  beyond  the  scope  of  a  simple,  two-frame  matching  approach, 
and  it  will  be  discussed  in  greater  detail  when  suggestions  for  a 
multiple  frame  approach  are  presented  later  in  the  paper 


Even  relatively  simple  undergrouping  errors,  rule  out  the  pos¬ 
sibility  of  a  one-to-one  mapping  bet  ween  tokens  from  successive 
frames,  since  the  number  of  tokens  in  each  frame  will  rarely  be 
the  same.  Under  these  circumstances,  it  seems  that  the  best 
mapping  possible  is  a  mapping  that  assigns  each  token  at  least 
one  match,  and  optimizes  some  error  function  in  the  process. 
Such  a  mapping  is  called  a  minimal  bipartite  rover.  We  first 
encountered  the  minimal  bipartite  cover,  in  the  context  of  the 
correspondence  problem,  as  part  of  U liman's  minimal  mapping 
theory  |20|.  Interestingly,  the  motivation  Ullman  gives  for  its  use 
is  unrelated  to  the  grouping  failure  argument  presented  here.  We 
believe  that  the  minimal  bipartite  cover  is  simply  a  more  prac¬ 
tical  goal  than  a  one-to-one  mapping  when  matching  abstract 
tokens  prone  to  grouping  errors 

3.  FROM  OPTICAL  FLOW  TO  TOKEN  MATCHES 

Unmans  minimal  mapping  theory,  which  presents  the  corre¬ 
spondence  problem  as  an  optimization  problem  iri  the  abstract, 
is  a  very  general  paradigm,  and  it  serves  as  a  useful  point  of  de¬ 
parture  for  the  discussion  of  the  method  proposed  in  this  paper 
In  the  minimal  mapping  theory,  correspondence  is  computed  be¬ 
tween  tokens  from  two  frames  by  finding  the  minimal  bipartite 
cover  of  the  graph  whose  nodes  are  the  tokens  from  each  frame 
and  whose  arcs  reflect  potential  correspondence.  The  weight  of 
each  arc  in  the  graph  is  called  an  ajjinity  measure,  and  is  a  func¬ 
tion  of  the  relative  similiarity  and  spatial  separat  ion  of  the  two 
tokens  which  the  arc  links.  Ullman  justifies  his  choice  of  partic¬ 
ular  affinity  values  with  data  from  studies  of  the  human  visual 
system.  Indeed,  the  minimal  mapping  theory  is  offered  as  a  pos¬ 
sible  explanation  for  the  manner  in  which  many  of  the  classic 
Gestalt  displays  such  as  Ternus’  configuration  are  interpreted  by 
the  human  visual  system.  Because  of  the  explosively  large  num¬ 
ber  of  possible  mappings  for  matching  problems  of  even  modest 
size,  Ullman  simplifies  the  general  matching  problem  by  assum¬ 
ing  that  the  number  of  candidate  matches  that  each  token  can 
claim  is  equal  to  some  small  integer  constant,  lie  then  shows, 
that  under  this  assumption,  the  optimization  problem  can  be 
solved  by  a  bill  climbing  process,  which  leads  to  a  relaxation 
algorithm.  However,  no  method  for  choosing  initial  candidate 
matches  is  offered,  and  the  number  of  iterations  required  for 
convergence  is  unclear. 

The  approach  described  in  this  paper  reflects  a  natural  syn¬ 
thesis  of  the  optical  flow  and  token  matching  paradigms.  Because 
the  optical  flow  field  is  a  vector  function  of  the  linage  plane,  it 
can  be  used  to  define  a  transformation  that  maps  tokens  from 
one  frame  to  theii  predicted  positions  in  the  next  frame.  I  lie 
spatial  area  that  must  be  searched  for  a  potential  match  is  re¬ 
duced  to  a  small  region  surrounding  the  token's  predicted  posi¬ 
tion.  Furthermore,  the  merit  of  a  potential  match  is  judged  by 
its  proximity  to  its  predicted  position,  not  to  its  previous  posi¬ 
tion.  This  simplifies  the  general  matching  problem  by  restricting 
the  number  of  candidate  matches. 

4.  AN  IMPLEMENTATION  USING  LINE  TOKENS 

In  this  section,  we  describe  an  implementation  of  the  general 
approach  just,  outlined.  A  schematic  view  of  the  implementation 
is  depicted  in  Figure  1 . 

In  a  preliminary  step,  a  set  of  line  tokens  is  computed  for 
each  frame  using  Boldt 's  line  grouping  algorithm  (figures  2  and 
3)  Boldt  s  algorithm  begins  by  extracting  an  initial  set  i*f  line 
segments  whose  orientation  is  the  normal  in  the  gradient  di¬ 
rection  along  zero  crossing  contours  of  the  l.aplacian  operator 
These  initial  line  segments  form  the  nodes  of  a  graph  whose  arcs 
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(links  in  Boldt’s  terminology)  reflect  a  significant  non-accidental 
geometric  relationship  between  the  two  line  segments  they  join . 
Some  of  the  relations  used  as  linking  criteria  are  endpoint  prox¬ 
imity,  orientation  difference,  lateral  distance,  overlap  and  con¬ 
trast  difference.  All  paths  through  the  link  graph  within  the 
current  replacement  radius  are  examined  and  the  path  minimiz¬ 
ing  the  mean-square-error  of  a  straight  line  fit  is  replaced  by  a 
new  line  segment.  The  program  is  then  invoked  recursively  on 
the  new  set  of  line  segments,  using  a  larger  replacement  radius, 
resulting  in  ever  smaller  sets  of  increasingly  longer  lines.  A  final 
set  of  between  one  and  two  hundred  fines  is  produced  by  filtering 
on  length  and  contrast. 

In  a  second  preliminary  step,  the  optical  flow  field  is  com¬ 
puted  using  the  method  developed  by  Anandan  1  (Figure  4). 
Strictly  speaking,  Anandan’s  algorithm  produces  a  displacement 
field,  not  a  flow  field.  The  intensity  constancy  constraint  ex¬ 
ists  implicitly  as  a  sum-of-squared-difference  measure  within  a 
Laplacian  pyramid.  Anandan’s  algorithm  uses  know  ledge  of  the 
direction  of  principle  curvature  of  the  sum-of-squared-difference 
surface  to  enforce  a  smoothness  constraint  at  each  level  of  the 
laplacian  pyramid.  These  design  choices  together  comprise  a 
working  system  that  appears  to  consistently  yield  reliable  esti¬ 
mates  of  image  displacements. 

The  predicted  position  for  each  line  from  the  first  frame  is 
computed  by  a  least  squares  fit  to  the  points  comprising  the  im¬ 
age  of  that  line  under  the  transformation  defined  by  the  optical 
flow  field.  All  lines  from  the  second  frame  passing  through  a  nar¬ 
row  rectangular  region  surrounding  this  predicted  position  are 
retrieved  (Figure  5).  The  size  of  the  search  region  is  a  parameter 
of  the  system.  Although  currently  a  constant,  it  could  conceiv¬ 
ably  be  coupled  to  the  value  of  confidence  measures  associated 
with  the  optical  flow,  such  as  those  computed  by  Anandan’s  al¬ 
gorithm.  In  this  way,  the  window  size  wouid  be  smaller  in  areas 
of  high  confidence  and  larger  in  areas  of  low  confidence. 

A  bipartite  graph,  henceforward  called  the  time-link  graph ,  is 
constructed  Its  arcs  connect  fine  segments  from  the  frame  one 
token  set,  to  all  candidate  matches  retrieved  from  the  second 
frame  (Figure  6).  The  weight  of  each  arc  in  the  time-link  graph 
is  a  measure  of  the  discrepancy  in  position  between  the  pre¬ 
dicted  position  of  the  frame  one  fine  segment  and  the  position 
of  the  candidate  match.  Since  the  line  segments'  lengths  are 
highly  unstable  (because  of  undergrouping  errors)  information 
about  length  is  not  incorporated  into  the  positional  discropain  > 
metric.  Instead,  the  measure  approximates  the  average  distance 
between  the  two  segments,  independent  of  their  length*  (Figure 
7)  Ideally,  one  would  usp  a  measure  similiar  to  that  employed 
by  Lowe  in  his  model  matching  system  [9l,  which  computes  the 
probability  of  the  juxtaposition  of  two  line  segments  being  due  to 
chance  alone,  using  knowledge  of  the  distribution  of  background 
line  segments 

By  computing  the  connected-components  of  the  time-link 
graph,  the  global  matching  problem  is  conveniently  divided  into 
smaller,  individually  tractable  pieces  which  reflect  the  scone  of 
potential  interactions.  For  each  connected-component,  the  bi¬ 
partite  cover  minimizing  the  positional  discrepancy  metric  is 
found  This  is  accomplished  through  a  simple  blind  search  of 
the  sub-graphs  of  each  connected-component  Although  (  liman 
suggests  solving  thp  optimization  problem  through  network  re¬ 
laxation.  the  need  for  such  an  approach  is  eliminated  here  be¬ 
cause  of  the  relatively  small  size  of  each  connected-component 
Indeed,  a  connected-component  often  contains  only  a  single  arc. 
in  which  case  the  match  is  uniquely  determined.  I  his  is  directly 
due  to  the  heuristic  use  of  the  optical  flow  field  The  bipartite 
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Figure  5.  The  rectangular  search  regions  computed  lor  eacn 
line  token  by  a  least  squares  fit  to  the  tips  of  the  displacement 
vectors.  Lines  from  frame  two  that  intersect  the  search  region 
become  candidate  matches. 
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Figure  (>.  The  timr-link  graph,  where  ihc  arcs  connect  lines 
in  frame  one  with  their  candidate  matches. 
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cover  reflects  the  final  correspondences  reported  by  the  system 
(Figure  8). 

This  system  has  been  used  to  process  more  than  fifty  dif¬ 
ferent  two-frame  matching  problems  drawn  from  six  different 
multi-frame  sequences.  All  the  sequences  are  composed  of  im¬ 
ages  of  real  scenes  and  contain  more  than  one  hundred  lines 
each  No  special  attention  was  paid  to  the  magnitude  of  the 
displacements  between  frames,  and  the  system  seems  reasonably 
robust  to  the  problems  posed  by  undergrouping  errors.  Encour¬ 
aged  by  the  quality  of  the  two-frame  results,  the  system  was  run 
repeatedly  on  successive  frames  of  a  multi-frame  sequence,  cre¬ 
ating  a  directed  acyclic  graph ,  or  dag ,  representing  the  splitting 
and  merging  of  line  segments  over  time.  The  results  of  one  such 
multi-frame  experiment,  involving  several  rotating  objects,  are 
shown  in  Figures  9-12.  Results  from  a  second  sequence,  taken 
by  camera  mounted  on  a  mobile  robot  panning  by  a  stairway, 
are  shown  in  Figures  13-16. 

Unfortunately,  the  interpretation  of  such  a  representation  is 
non-trivial.  For  example,  one  can  not  tell  from  local  information 
alone  whether  a  particular  split  or  merge  in  the  dag  is  due  to 
an  undergrouping  or  overgrouping  failure.  Although  the  mini¬ 
mal  bipartite  cover  functions  well  when  the  set  of  line  segments 
in  each  frame  is  the  same  (except  for  fragmentation  <1  tie*  to  un¬ 
dergrouping)  it  performs  badly  when  wholly  new  line  segments 
appear  or  dissapear.  This  can  be  caused  by  the  initial  filtering 
operation  used  to  reduce  the  number  of  line  segments  in  each 
frame. 

5.  P.O.  IN  PARAMETER  SPACE 

As  mentioned  before,  overgrouping  occurs  when,  through 
chance,  two  tokens  are  juxtaposed  in  such  a  way  as  to  satisfy 
the  requirements  for  grouping  For  example,  two  coplanar  line 
segments,  will  appear  colinear  when  viewed  from  any  point  in 
the  plane  in  which  they  both  lie  (i.e.  the  degenerate  view  plane). 
Such  a  pair  of  segments  is  likely  to  satisfy  the  grouping  require¬ 
ments  in  an  algorithm  such  as  Boldt’s,  and  will  be  grouped  as  a 
single  line  segment;  See  Figures  17  and  18.  The  probability  of 
a  moving  sensor  passing  through  the  degenerate  view  plane  for 
some  pair  of  lines  is  relatively  high,  especially  in  a  man  made 
environment .  due  to  the  plethora  of  horizontal  and  vertical  lines. 
However,  if  we  choose  to  describe  each  line  as  a  point  in  the  p  6 
parameter  space,  and  examine  the  set  of  such  points  through 
time,  we  will  find  two  distinct  trajectories  that  intersect  during 
the  degenerate  view.  The  appropriate  solution  seems  to  be  to  di¬ 
vide  the  set  of  points  in  parameter  space  into  distinct  trajectories 
corresponding  to  separate  physical  entities.  This  is  a  perceptual 
organization  problem,  and  the  space  to  be  organized  is  defined 
by  the  parameters  of  the  token.  For  point  tokens,  the  parame¬ 
ter  space  happens  to  be  the  image  plane,  but  for  line  segments, 
the  most  suitable  parameters  are  p  and  S.  which  are  stable  even 
when  undergrouping  errors  make  the  determination  of  the  seg¬ 
ment  s  endpoints  impossible.  Solving  the  problems  peculiar  to 
perceptual  organization  of  line  token  trajectories  in  p  -6  t  will 
be  our  major  goal  for  the  next  few  months.  By  extending  the 
“perceptual  radius"  to  a  larger  number  of  frames,  we  hope  to 
take  advantage  of  good  continuity,  which  will  permit  matching 
in  spite  of  single  frame  grouping  errors 

6.  CONCLUSION 

An  optical  flow  field  is  a  vector  function  of  the  image  plane.  It 
is  a  very  simple  characterization  of  the  changing  intensity  func¬ 
tion  that  results  when  a  dynamic  scene  is  imaged.  Compared 
to  token  matching,  it  is  a  relatively  well  developed  paradigm 
and  several  different  algorithms  exist  for  computing  it  This 
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Figure  7.  The  positional  discrepancy  metric  approximates 
the  average  distance  between  the  candidate  match  and  the  pre¬ 
dicted  position  of  the  frame  two  line. 


F  igurc  8.  The  minimal  bipartite  cover  fi.r  each 
connected-component  of  (lie  time-link  graph  In  I  Ins  example,  a 
single  arc  has  been  removed 
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paper  explores  the  possibility  of  translating  optical  flow  into  to¬ 
ken  matches,  creating  a  more  abstract  representation  based  on  a 
directed  acyclic  graph.  The  nodes  of  this  graph  are  tokens  cor¬ 
responding  to  spatial  structure  and  the  arcs  reflect  correspon¬ 
dence  between  frames.  In  addition  to  the  spatial  displacement 
of  th**  token,  this  representation  makes  the  changing  values  of 
the  token  s  parameters  through  time  explicit.  The  approach  is 
demonstrated  by  a  working  implementation  which  uses  line  to¬ 
kens.  Finally,  it  is  proposed  that  the  best  path  to  pursue  in 
future  work  is  perceptual  organization  in  the  parameter  space 
of  the  token.  Hopefully,  this  will  provide  increased  reliability  in 
the  face  of  single  frame  grouping  errors. 


Figure  9.  The  lirst  frame  ol  a  motion  sequence  containing 
multiple  independently  moving  objects. 


Figure  10.  The  line  tokens  computed  for  the  first  frame 
Line  tokens  which  will  be  used  to  illustrate  the  output  of  the 
matching  process  are  displayed  thick 


Figure  11.  The  displacement  field  computed  for  the  first 
and  second  frame  of  the  sequence.  Note  the  rotation  of  the  box 
and  the  soccer  ball 


Figure  12.  The  output,  of  the  matching  process  tor  selected 
lines.  This  figure  was  computed  by  displaying  the  lines  encoun¬ 
tered  during  a  depth  first  traversal  of  the  directed  acyclic  graph 
representing  the  token  matches.  Note  the  splitting  and  merging 
of  the  linos  outlining  the  pane?  of  the  soccer  ball.  The  changing 
orientation  of  the  line  tokens  is  explicitly  represented 
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Figure  13.  The  first  frame  of  a  motion  sequence  taken  I”,  a 
mobile  robot  panning  by  a  stairway 


I 

r 


Structure  Recognition  by  Connectionist  Relaxation: 
Formal  Analysis 


Paul  Cooper 

>  Department  of  Computer  Science 

University  of  Rochester 
Rochester,  NY  14627 
cooper@cs.rochester.edu 


Abstract 

The  paper  formally  describes  a  connectionist 
implementation  of  discrete  relaxation,  for  labelled 
graph  matching.  The  application  is  fast  parallel 
indexing  from  structure  descriptions.  The  network  is 
limited  by  complexity  considerations  to  the  detection 
and  propagation  of  unary  and  binary  consistency 
constraints. 

The  convergence  of  the  algorithm  is  proved.  The 
desired  behaviour  of  the  algorithm  is  formally 
specified,  and  the  fact  that  the  network  correctly 
computes  this  is  proved.  Explicit  and  exact  space  and 
time  resource  requirements  are  developed. 

1.  Introduction 

This  paper  provides  a  formal  description  and 
analysis  of  a  connectionist  network  whose 
experimental  performance  was  described  earlier 
[Cooper  and  Hollbach  1987],  The  network  performs  a 
limited  kind  of  labelled  graph  matching  by  discrete 
relaxation  (or  constraint  propagation).  It  was  designed 
for  the  indexing  task  in  recognizing  structured  objects. 

The  convergence  of  the  algorithm  is  proved.  The 
desired  behaviour  of  the  algorithm  is  formally 
specified,  and  the  fact  that  the  network  correctly 
computes  this  is  proved.  Explicit  and  exact  space  and 
time  resource  requirements  are  developed.  The 
feasibility  of  the  algorithm  given  its  resource 
requirements  is  discussed. 


2.  Graphs  and  Matchings 

Various  types  of  graphs  and  relational  structures 
are  often  used  to  represent  image  information  (eg. 
[Kitchen  and  Rosenfeld  1979,  Shapiro  and  Haralick 
1981]  ).  In  this  section,  I  introduce  simple  definitions 
for  graphs  and  matching.  These  particular  definitions 
were  chosen  to  map  easily  onto  the  corresponding 
connectionist  network  that  is  specified  later. 

Definition  1: 

A  labelled  graph  is  a  triplet  G  =  (P,R,L),  where  P  is 
a  set  of  parts,  or  more  typically,  nodes,  R  is  a  set  of 
relations  or  arcs,  and  L  is  a  set  of  node  labels.  The 
parts  (elements  of  the  set  P)  are  designated  p(.  The 
label  associated  with  pt  is  designated  1..  For 
simplicity,  the  lj  are  taken  to  be  integers  between  1 
and  10.  An  arc  between  nodes  pe  and  pf  is 
designated  ref. 

Sometimes  graphs  and  their  components  are 
designated  with  superscripts,  to  specify  which  graphs 
they  belong  to:  eg.  GA,  1A,  r,fA.  The  rest  of  the  paper 
concerns  itself  primarily  with  2  archetypical  graphs: 

GA  -  the  candidate  object 
GB  -  the  target  model 

Definition  2: 

a  matching  is  a  pair  of  sets  representing  a  mapping 
between  2  labelled  graphs.  The  matching  MAB 

between  graphs  GA  and  GB  is: 
mab  =  (  MpAB,  Mrab, 

where  the  elements  of  each  set  are  pairs  that 
designate  particular  pairwise  mappings.  To 
indicate  the  mapping  between  p(A  and  p; B,  an 


3.1  Units 


element  of  Mp  is  designated  m(j.  The  arc-arc 
mapping  between  refA  and  rghB  is  designated  n^  h. 
Note  that  arc  matches  have  no  direction. 

If  for  some  p(,  there  is  exactly  I  m^,  p:  is  said  to  match 
pr  A  node  match  mtj  is  called  a  valid  match  if  l  A  =  M*. 
I  often  refer  to  generic  elements  of  P,  Mp  and  MR  as 
respectively  pp  m^  ,  and  nefgh  ,  without  complete 
quantification. 

Note  that  a  matching  as  defined  is  completely 
unconstrained.  Matchings  are  not  directional.  It  may 
be  that  a  matching  is  not  one-to-one.  Matchings  may 
also  be  inconsistent,  where  consistency  is  defined  as 
follows. 

Definition  3: 

a  node  match  m  and  arc  match  n  ,  .  are  consistent 

*J  ef.gn 

iff: 

(i  =  e  |  i  =  f)  &  (j  =  g|j  =  h) 

(ie.  the  matched  nodes  are  each  one  end  of  the 
matched  arcs). 

An  arc  match  is  said  to  be  locally  consistent  if  it  has 
consistent  node  matches  at  each  end  of  the  arc. 

It  is  clear  that  a  matching  which  is  both  one-to-one 
and  locally  consistent  is  an  isomorphism  between  GA 
and  GB. 

3.  Network  Configuration 

Next,  I  specify  a  connectionist  model  which  can 
represent  any  2  graphs  GA  and  Gn,  and  all  possible 
matchings  between  them.  Needless  to  say,  the  model 
will  also  compute  locally  consistent  matches,  including 
isomorphisms.  I  adopt  the  definitions  of  Feldman  and 
Ballard  [1982]  as  the  formal  basis  for  the  connectionist 
units. 

Connectionist  models  are  inherently  finite. 
Therefore,  the  size  of  the  graphs  which  can  be  matched 
must  be  bounded  a  priori.  Let  some  constant  N  bound 
the  number  of  nodes  in  the  graph.  Aside  from  this  size 
limitation,  the  model  will  be  universal  for  any  pairwise 
graph  matching  problem. 


I  now  define  sets  of  units  corresponding  to  all 
elements  of  the  2  graphs  GA  and  GB,  and  all  potential 
matchings  MAB.  For  convenience,  I  will  refer  to  the 
individual  connectionist  units  with  the  same 
designators  as  the  graph  and  matchings  elements, 
leaving  context  to  disambiguate.  The  following  are  the 
unit  sets: 

U(PA)  -  the  units  corresponding  to  the  graph  nodes  of 
the  candidate  object 

U(PB)  -  likewise  for  the  target  model  graph 
U(Mp)-  the  node  matching  units  corresponding  to  the 

U(RA)  -  the  units  corresponding  to  the  arcs  of  the 
candidate  object 

U(RB)  -  likewise  for  the  target  model  graph 
U(Mr)-  the  arc  matchings  units  corresponding  to  the 

nef,gh 

I  call  the  units  U(PA),  U(PB),  U(RA),  and  U(RB) 
graph  units ,  and  the  units  U(MP)  and  U(MR)  matching 
units.  In  particular,  the  m^are  the  node  matching  units 
and  the  nff  h  are  the  arc  matchings  units.  The  graph 
units  effectively  act  as  input  units  to  the  model, 
describing  the  graphs  GA  and  GB  for  a  particular 
problem  instance. 

These  sets  of  units  are  best  conceived  of  in  terms  of  2 
arrays,  as  in  Figure  1.  The  left-hand  array  represents 
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Figure  1 :  node  matching  array,  arc  matching  array,  and 
example  constraint  propagation  links 

the  nodes  of  both  graphs,  and  all  possible  node 
matchings.  The  right-hand  array  represents  the  arcs  of 
the  graphs,  and  all  possible  arc  matchings.  This 
connectionist  representation  of  graphs  and  matchings 
is  a  classic  example  of  the  connectionist  unit/value 


principle  in  action.  Lack  of  an  interpreter  forces  the 
network  to  explicitly  represent  all  bindings  between 
the  2  graph  component  sets.  If  exactly  one  unit  in  each 
row  and  column  of  each  array  were  active,  this  would 
represent  an  isomorphism  between  the  2  graphs. 


-  each  nefgh  is  connected  to  nefab  for  all  ab  and 

nab,gh  for  a11  ab 

Cross-array  Connections 

-  each  n  ,  ,  connects  to  exactly  4  mu  as  follows: 


3.1.1  Network  Complexity:  Units 

In  analyzing  the  complexity  of  a  connectionist 
network,  the  most  important  characteristic  is  the 
number  of  units  required  for  the  computation.  But  the 
fixed  size  of  the  universal  network  is  clear; 

|  U(P°bJ)  |  =  |  U(Pmodel)  |  =  N 
|  U(Mp)  |  =  N2 

|  U(RobJ)  |  =  |  U(Rmodel)  |  =  *N(N  - 1) 

|  U(Mr)  |  =  [  +N(N  - 1)  ]2 

So  the  total  number  of  units  is  0(N4). 

3.2Connections 

The  connections  between  the  units  are  now 
specified.  All  connections  are  symmetric,  with  weights 
1. 

Graph  Input  Connections 

-  all  pA  connect  to  all  my  for  all  j 

-  all  connect  to  all  my  for  all  i 

-  all  refA  connect  to  all  npf  gh  for  all  gh 

-  all  rghB  connect  to  all  nef  gh  for  all  ef 

These  connections  allow  the  representation  of  the 
input  graphs  to  interact  with  the  matchings  units. 

Winner -take -all  Connections 

Each  nr.  and  nefgh  is  to  take  part  in  a  winner-take- 
all  competition  (WTA)  against  the  other  units  in  its 
row  and  column.  There  are  a  variety  of 
implementations  of  WTA  nets,  any  of  which  is  suitable. 
For  simplicity  in  describing  unit  functions  later,  a 
WTA  network  is  used  in  which  each  unit  compares 
itself  to  the  maximum  of  all  the  others. 

-  each  m|(  is  connected  to  mlk  for  all  k  and  mkj  for 


(i)  i=e  &j  =  g 

(ii)  i  =  e&j  =  h 

(iii)  i  =  f&j  =  g 

(iv)  i  =f  &  j  =  h 

These  cross-array  connections  allow  each  potential  arc 
match  to  connect  to  all  4  consistent  node  matches. 
With  the  pA  designated  by  numerals,  the  pA 
designated  by  letters,  and  the  refA  and  rghB  designated 
correspondingly,  Figure  1  demonstrates  these  cross¬ 
array  connections. 

3.2.1  Network  Complexity:  Connections 

Sometimes  the  number  of  connections  is  relevant  to 
determining  the  feasibility  of  a  network.  Fan-in  and 
fan-out  values  are  relevant  as  well. 

Number  of  Connections 

-  there  are  2N2  graph  input  connections  in  the 
node  matching  array 

-  there  are  0(N4)  graph  input  connections  in  the 
arc  matching  array 

-  there  are  N2  (N  -  1)  WTA  connections  in  the 
node  matching  array 

-  there  are  0(N6)  WTA  connections  in  the  arc 
matching  array 

-  there  are  0( N4)  cross-array  connections 

Aside  from  the  WTA  network  (for  which  more  efficient 
implementations  exist),  the  are  Oi N4)  connections  in 
the  network. 


Fan-out 


the  fan-out  of  the  graph  units  is  N  and  N2  for  the 
node  and  arc  units  respectively 
fan-out  for  the  nefgh  arc  matching  units  is  2(N2  - 
1)  for  the  WTA  connections,  and  4  for  the  cross¬ 
array  connections 
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4.  Network  Computation 


-  fan-out  for  the  m-  node  matching  units  is  2(N-1) 
for  the  WTA  connections,  and  (N  -  l)2  for  the 
cross-array  connections. 

Alternate  WTA  configurations  have  lower  fan-out 
and  numbers  of  connections.  It  is  also  possible  to  avoid 
the  0(N2)  fan-out  from  each  m-  across  the  arrays.  For 
example,  an  output  tree  could  be  used. 

3.3  Network  Complexity:  Feasibility 

With  an  analysis  of  the  number  of  units  and 
connections  required  for  the  network,  we  have 
characterized  the  space  complexity  of  the  network.  The 
question  then  becomes  one  of  whether  the  complexity 
allows  a  feasible  implementation  of  the  network. 
Essentially,  the  network  is  N4  in  units  and  connections, 
and  N2  in  fan-out.  Usually,  the  feasibility  of 
connectionist  networks  is  determined  by  comparing  the 
space  complexity  required  with  what  is  available  in  the 
human  brain.  Except  for  large  N,  by  this  comparison 
the  space  complexity  of  this  network  is  quite  small. 
This  is  particularly  true  if  one  recalls  that  the 
algorithm  is  not  designed  for  matching  general  graphs 
of  arbitrary  size,  but  for  matching  structured 
representations  of  objects.  In  creating  a  structured 
representation  for  objects,  sensibility  and  complexity 
bounds  (like  the  ones  here)  suggest  the  use  of 
hierarchies,  which  keeps  N  quite  low.  (By  quite  low,  I 
mean  around  10.  Structured  representations  with 
many  more  than  10  pieces  would  be  quite  amenable  to 
hierarchic  representation.) 

An  alternative  way  to  verify  that  the  space 
complexity  requirements  of  the  network  are  reasonable 
is  to  attempt  to  actually  implement  it.  As  the 
application  experiment  has  shown  [Cooper  and 
Hollbach  1987],  this  is  indeed  feasible.  Even  for 
experiments  with  larger  N,  say  N  about  10,  it  is  quite 
straight-forward  to  implement  a  10,000  unit  network. 
(Such  a  network  implemented  on  a  parallel  processor 
such  as  the  Butterfly  can  even  be  expected  to  be 
simulated  in  reasonable  time  [Fanty  and  Goddard 
1985]). 


4.1  Overview 

With  the  background  and  network  configuration 
established,  it  is  now  possible  to  describe  in  detail  how 
the  network  performs  its  relaxation  and  correctly 
matches  the  2  graphs. 

Clearly,  the  2  arrays  of  matching  units  represent  all 
possible  individual  match  pairings,  as  well  as  all 
combinations.  Each  particular  match  for  a  given  graph 
element  (node  or  arc)  competes  with  all  other 
possibilities.  The  reinforcement  of  locally  consistent 
node  and  arc  matches  by  propagation  of  constraints 
between  the  2  networks  ultimately  determines  the 
final  matching  between  the  graphs. 

There  are  2  initialization  steps.  First,  the 
particular  case  of  GA  and  GB  to  be  computed  is 
instantiated  by  locking  the  potential  of  the  input  graph 
nodes.  Second,  the  matching  units  compute  from  the 
inputs  their  initial  state.  It  is  at  this  stage  that  match 
seeds  are  determined.  That  is,  local  uniquenesses  are 
found  in  both  graphs  and  matched.  Such  bound 
"winners”  have  high  potential. 

The  relaxation  itself  is  a  synchronous  computation. 
Each  time  step  consists  of  2  substeps.  The  first  substep 
is  the  constraint  propagation  phase.  In  this  substep, 
node  matching  units  send  messages  to  consistent  arc 
matching  units,  and  vice  versa.  The  second  substep  is 
the  winner-take-all  contest.  Units  which  win  the 
contest  signify  matches  between  particular  nodes  or 
particular  arcs.  Losers  signify  incorrect  matches;  these 
units  eventually  turn  off.  As  the  computation 
proceeds,  the  high  ("matched”)  potential  propagates 
bach  and  forth,  growing  larger  and  larger  locally 
consistent  matches  from  the  seed  matches.  The  match 
terminates  when  the  largest  possible  number  of  locally 
consistent  matches  have  been  determined. 

4.2  Internal  Unit  Structure 
4.2.1  Graph  Units 

Graph  units,  including  the  p  A  Pj8  refA  and  r?hB,  are 
very  simple.  They  each  have  a  set  of  discrete  states 


{off,  on}.  The  potential  corresponding  to  state  off  is  0 
for  the  P;  and  refl  while  the  on  potential  is  1  for  the  ref 
and  equal  to  l^or  the  p;.  Unit  output  is  equal  to  unit 
potential. 

These  units  are  locked  in  one  state  (and  potential) 
or  the  other  for  the  duration  of  the  computation.  Which 
units  are  active  depends  on  the  particular  instance  of 
GA  and  GB,  as  described  in  the  next  section. 

4.2.2  Matching  Units 

The  units  that  represent  particular  matches  are 
more  complex.  Each  unit  uit  where  u;  €  Mp  or  u;  €  MR 
(ie.  the  ut  is  a  mtj  or  nef  .)  is  described  as: 

u{  =  <S,,  X,,  D(  > 

Each  is  a  tuple  consisting  of 

St  -  the  unit’s  state,  where  S;  €  {  off  contending, 
bound } 

X,  -  the  unit’s  potential  (equal  to  its  output), 
where  X,  €  { 0, 1, 2, ...  20 } 

D1  -  a  set  of  data  inputs  to  the  unit 

The  data  inputs  Dj  are  conveniently  divided  into  sites, 
with  a  vector  of  inputs  at  each  site.  Thus: 

^  dgraph’  ^ WTA ’  ^cross-array  ^ 

corresponding  to  the  3  types  of  connections  attached  to 
each  matching  unit.  Note  that  the  same  subscript  is 
used  to  designate  the  unit,  state,  and  potential.  Thus, 
the  state  of  a  typical  node  matching  unit  m^is 
designated  simply  S(j. 

4.3  Problem  Instantiation 

The  graph  units  p/,  p^,  refA,  and  r?hB  represent  all 
possible  object  and  model  graphs  (of  size  bound  N). 
These  units  act  as  the  input  to  the  match  computation. 
They  must  first  be  activated  correctly  to  represent  the 
particular  candidate  object  graph  and  target  model 
graph  for  this  computation.  This  process  occurs  as 
follows. 

For  each  node  in  the  object  graph  GA,  the 
corresponding  node  unit  p,A  is  activated.  The  potential 
of  the  unit  is  locked  at  lt,  the  label  of  pr  Units  p^  are 
activated  likewise. 


Suppose  the  input  graphs  have  NA  and  NB  nodes 
respectively.  Note  that  they  may  be  unequal,  and 
likely  less  than  N.  Only  NA  nodes  PjA  are  activated. 

For  each  arc  present  in  the  graphs,  an  arc  unit  ref  is 
activated.  Because  the  arcs  are  unlabelled,  the 
potential  of  these  units  is  locked  to  1.  The  mapping  of 
graph  arcs  to  arc  units  must  correspond  in  the  obvious 
way  to  the  mapping  of  nodes  to  node  units.  Note  that 
for  most  problem  instances  (except  for  those  with  very 
highly  connected  graphs),  many  fewer  arc  units  are 
activated  than  exist  in  the  general-use  network. 

4.4  Initialization  of  Matching  Units 

Match  units  receive  data  Dgraph  from  the  graph 
input  units  initialized  as  described  above.  However,  it 
is  only  in  the  very  first  time  step  that  the  potential  of 
the  match  units  changes  as  a  function  of  these  graph 
inputs  Dgraph. 

Activation  Function  for  Initialization  of  Node  Match 
Units 

For  each  unit  m^ ,  2  specific  data  inputs  of  Dgraph  are 
identified  and  designated  as  dp.  and  dPj  ;  ie.  the 
activation  coming  from  p  A  and  p-B .  In  pseudocode,  the 
initialization  activation  function  is  then: 

if  initialization  step  then 
if  dPi  =  dPj  then 
S(J  <—  contending 
X  <-l 


Activation  Function  for  Initialization  of  Arc  Match 
Units 

For  the  arc  matching  units  nefgh  ,  the  data  inputs  dref 
and  drgh  are  special. 

if  initialization  step  then 
if  drof  =  dr8h  then 
Sefgh  contending 


to 

$ 


CW 
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Note  that  while  the  activation  functions  for  the  units 
are  essentially  the  same,  their  effect  is  different.  In  the 
node  matching  array,  all  units  representing  valid 
matches  are  initialized  to  the  contending  state,  because 
the  input  represents  the  labels  on  the  p(  and  p;.  In  the 
arc  matching  array,  all  possible  arc  matches 
(corresponding  to  arcs  which  actually  exist  in  this 
problem  instance)  are  activated. 

4.5  Unit  Function 

The  state  and  potential  of  a  unit  are  generally 
functions  of  the  previous  state,  previous  potential,  and 
current  input.  This  section  describes  the  unit  function  - 
the  piece  of  code  which  determines  what  the  state  and 
potential  of  a  unit  will  be,  given  the  previous  state  and 
potential,  and  the  current  inputs.  Connectionist  nets 
typically  have  large  numbers  of  units  whose  state  is 
controlled  by  the  same  function.  In  the  graph 
matching  network,  it  is  convenient  for  proof  purposes 
that  the  node  matching  units  and  arc  matching  units 
have  slightly  different  unit  functions. 

Both  unit  functions  access  a  time  state  variable  as 
well  as  previous  state  and  a  subset  of  current  inputs. 
Strictly  speaking,  this  is  just  an  approximation  to  what 
would  actually  be  required,  which  would  be  a  timing 
network  to  keep  the  units  in  synchrony,  with  the 
appropriate  substeps. 

The  unit  functions  also  use  a  distinguished 
potential  to  signify  their  being  in  the  state  bound  -  this 
potential  is  denoted  p,  and  is  equal  to  an  activation  of  5. 

Unit  Function  for  Node  Matching  Units  u  ,  where  ui  is  a 


if  S,  =  contending 

if  winner-take-all  substep  then 

if  X|  >  max  (  DWTA  )  then 
S,  <—  bound 
X,  <-p<-5 
else 

if  Xt  <  max  (  DWTA  )  then 


^  i-off 

x.-o 

>e 

Sj  *—  contending 
X.<-1 


else  if  propagation  substep  then 
X;  <— Ed  where  d  €  D 

crov-  trray 

Unit  Function  for  Arc  Matching  Units  u  ,  where  u  is  a 


if  S;  =  contending 

if  winner-take-all  substep  then 

if  X;  >  max  (  DWTA  )  then 

St «—  bound 

V-1 

else 

ifX,  <  max  (  DWTa  )  then 

S,  i-off 

Xj-0 

else 

Sj  <—  contending 

Xj*-0 

else  if  propagation  substep  then 

X;  £  d  where  d  €  D 

cross-array 

The  unit  functions  are  basically  straight-forward. 
One  complication  arises  because  they  are  not  identical. 
The  differences  between  them  are  subtle  but  useful  at 
the  proving  stage.  The  arc  matching  nodes  in  their 
contending  states  output  a  potential  of  zero,  and  in 
their  bound  state  a  potential  of  1.  The  node  matching 
units  receiving  these  messages  can  thus  only 
discriminate  between  bound  and  not  bound  states.  The 
node  matching  units  send  0,  1,  and  p,  on  the  other 
hand,  so  the  receiving  arc  matching  units  can 
distinguish  between  off,  contending ,  and  bound 
messages. 
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5.  Network  Properties 
5.1  Convergence 

Theorem  1:  The  Network  Converges 

Proof: 

Clearly,  only  the  matching  units  rm  and  nefgh  are 
relevant.  At  the  completion  of  each  time  step,  each 
such  unit  ui  has  state  si  €  {  off.  contending ,  bound  }. 
Define  a  global  goodness  measure  or  total  for  the 
network 
T  =  £  T 

as  the  sum  of  the  individual  unit  goodnesses,  where 


T  =  1 

ifSl  =  bound 

o 

II 

if  Sj  =  off 

Tj  ~  '1 

ifSj  =  contending 

Now,  units  in  state  bound  and  state  off  cannot 
change  state.  Therefore,  if  a  unit  u(  changes  state, 
T  must  increase.  Therefore,  T  is  a  monotonically 
increasing  function. 

But  the  number  of  units  is  a  finite  constant,  so  T  can 
only  increase  a  finite  number  of  times. 

Therefore,  the  computation  must  eventually 
converge. 

QED 

5.2  Correctness 

The  network  works  by  the  detection  and 
propagation  of  local  constraints.  Because  of  the  nature 
of  the  network’s  representation  of  matchings,  the  only 
local  constraint  available  is  consistency  between  valid 
(label  agreeing)  node  matches  and  arc  matches.  With 
this  in  mind,  it  is  necessary  to  characterize  formally 
just  what  kinds  of  matches  the  network  can  be  expected 
to  find,  and  then  to  prove  that  it  does  indeed  find  them. 
This  is  the  subject  of  the  next  few  sections. 

We  wish  the  network  to  determine  matchings  that 
are  locally  consistent.  The  very  nature  of  the 
algorithm  is  based  on  this  idea.  But  further,  we  want 
the  network  to  detect  the  matching  with  minimum 
ambiguity,  given  its  inherent  limitations  as  an 
algorithm.  Our  network  can  only  resolve  matching 
ambiguities  based  on  local  information.  Some 
ambiguities  are  only  resolvable  with  non-local 


information.  I  refer  to  a  matching  with  locally 
resolvable  minimum  ambiguity  as  a  matching  in  which 
all  graph  elements  that  are  matchable  with  local 
consistency  are  assigned  unique  matches. 

Hypothesis: 

The  network  determines  a  matching  which: 

a)  is  locally  consistent 

b)  has  locally  resolvable  minimum  ambiguity 
(all  nodes  and  arcs  which  are  matchable  with 
local  consistency  are  assigned  unique 
matches) 

Some  graph  elements  are  matchable  with  local 
consistency  by  being  locally  unique.  When  such  local 
uniquenesses  are  matched,  I  call  this  a  match  seed. 
Other  nodes  and  arcs  are  matchable  only  after 
consistency  has  propagated  to  them.  I  consider  first  the 
seed  matches.  All  these  terms  will  be  given  tight 
definitions  below. 

5.2.1  The  Network  Matches  Locally  Unique 
Nodes  and  Arcs 

Recall  during  the  following  discussion  that  the 
network’s  initial  processing  proceeds  as  follows: 

Time  Step  0: 

Initialization  of  all  m  and  n  r  .  (cf.  Section  4.1) 

ij  et,gh 

Time  Step  1:  (a)  Propagation  Phase: 

-  matching  units  change  potential  as  a  function  of 
cross-array  connected  unit  activation 

Time  Step  1:  (b)  WTA  Phase 

units  enter  WTA  competition  and  change 
potential  and  state  depending  on  outcome 

Time  Step  2:  (a)  Propagation  Phase 
(b)  WTA  Phase 


Definition : 

a  node  peA  has  a  unique  unary  (node)  match  meg  to 
node  pgB  iff 

(i)  M  =  igB 

&  ‘(ii)  V  pfA  [  (  e  =  f )  O  (  lpA  =  lfA  )  ] 


&  (iii)  V  phB  [  (  g  ->  =  h  )  3  (  lgB  — '  =  lhB  )  ] 

In  the  candidate  object  graph,  there  some  node  with 
a  unique  label  (ie.  no  other  node  in  the  candidate 
object  has  that  label).  In  the  target  model  graph, 
some  node  has  the  same  label,  and  that  label  is 
unique  in  the  target  model  as  well. 

Theorem  2:  The  network  detects  unique  unary  (node) 
matches. 

Proof: 

After  time  step  0,  consider  node  matching  unit  mir 
IfpA  =  PjB  is  a  unique  unary  label  match,  then  S-  = 
contending  and  P;.  =  1.  But  also: 

Vx,  mxj  [  (x  ->  =  i)  3  (Sxj  =  off&c  Pxj  =  0)  ] 

&  Vy,  miy  [  (y  -■  =  j)  3  (Siy  =  off&.  Pjy  =  0)  ] 

In  other  words,  m-  is  uniquely  active  in  its  row  and 
column. 

The  first  (propagation)  substep  of  Time  Step  1, 
nothing  is  changed.  So,  in  the  second  (WTA) 
substep  of  Time  Step  1: 

Vx,  mxj  [  (x  -i  =  i)  3  (Xi(  >  Xxj)  ] 

&  Vy,  miy  [  (y  ->  =  j)  3  (X-  >  Xjy)  ] 

Therefore,  at  the  end  of  Time  Step  1,  Sg  =  bound. 

QED 

Definition : 

an  arc  refA  =  (pcA,  pfAl  has  a  unique  binary  (arc) 
match  nrfigh 

to  arc  rgh8  =  (pKB.PhB)iff 

(i)  leA  -.  =  1*  &  lgB  -  =  lhB 

(the  node  labels  on  the  arcs  are  not  equal) 
&  (ii)  either  [(leA=  lgB)  &  ( lfA  =  lhB)  ] 
or  [(leA=  lhB)&(lA=  lgB)] 

(the  label  pairs  on  the  ends  of  the  arcs 
match  in  one  direction  or  the  other) 

&  (iii)  V  rwxA  =  (pwA,  pxA) 

[  (ef  ->  =  wx  )  3 

-Ue  =  lI&lf=lw))l 

(the  label  pair  in  the  candidate  object 
graph  is  unique) 

&  (iv)  V  ryzB  =  (pyB,  p2B) 

[  (gh  ->  =  yz)  3 

<  -’Ug  =  ly&lh  =  lz)  I 


-dB  =  l,&lh  =  ly))  ] 

(the  label  pair  in  the  target  model  graph  is 
unique 

Theorem  3:  The  network  detects  unique  binary  (arc) 
matches. 

Proof: 

Consider  each  nefgh  after  Time  Step  la,  the 
propagation  phase.  Recall  that  each  nefgh  is 
connected  to  exactly  4  m^;  meg  meh  mfg  mm.  At  the 
beginning  of  time  step  1,  each  of  these  rm  had  a 
potential  of  1  (if  L  =  Ij)  or  0.  Therefore,  the 
potential  X  of  any  nef  gh  is  the  number  of  contending 
valid  node-node  matches  which  are  consistent  with 
the  arc-arc  match  nefgh. 

Now,  suppose  nefgh  is  a  unique  binary  match. 
Recall  from  condition  (i)  of  the  definition  of  unique 
binary  match  that  the  nodes  on  the  ends  of  the 
matched  arcs  may  not  be  the  same.  Therefore,  Xefgh 
->=4.  To  have  Xefgh  =  3  is  impossible.  So,  Xefgh 
must  be  0,1,  or  2.  If  nefghis  a  unique  binary  match, 
then  Xef  gh  =  2,  and: 

either  Xoir  =  1  &  X„  =  1 

eg  tn 

or  Xeh  =  1  &  Xfg  =  1  (from  condition  (ii) 

of  the  definition). 

But  because  the  particular  combination  of  pairs  is 
unique  by  definition,  all  the  other  elements  of  the 
row  and  column  containing  nefgh  have  X  <  2  (from 
condition  (iii)  and  condition  (iv)  of  the  definition). 
Therefore,  the  potential  of  nefgh  is  larger  than  that 
of  all  units  in  its  row  and  column,  so  it  wins  the 
WTA  competition. 

And,  at  the  end  of  Time  Step  1,  Sef  gh  =  bound. 

QED 

Overall  then,  the  network  computes  seed  matches 
for  either  unique  unary  or  binary  seed  matches.  But 
what  if  the  graphs  contain  no  local  uniquenesses  to  act 
as  match  seeds?  Such  graphs  are  highly  ambiguous 
locally,  containing  many  symmetries.  The  network 
specified  cannot  determine  unique  matches  in  this 
case,  but  will  represent  all  the  ambiguous  matches, 
with  matching  units  stuck  in  state  contending. 
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5.2.2  The  Network  Propagates  Matches  When 
Possible 

Definition: 

an  arc  rpfA  is  matchable  by  propagation  at  time  t  if 

(i)  3  rghB  [Sef  gh  =  contending] 

and  either 

(ii)  (Seg  =  bound  &  =  bound)  \ 

(Seh  =  bound  &  Sf  =  bound) 

(both  ends  are  consistently  matched,  in 
one  direction  or  the  other) 

or  (iii)  1  of  4  analogous  situations  is  true.  One  of 
the  4  cases  is  described  as  follows  (the 
others  are  obtained  from  the  other  4 
consistent  permutations  of  the  e,f,g,  and  h 
parameters): 

(Seg  =  bound )  & 

(S^  =  contending)  & 

Vx  [  (x  -•  =  h  &  3  rgx8  )  D  (Sfx  =  off)  ]  (*) 

(One  end  is  matched,  the  other  end  has  a 
consistent  contending  match,  and  the 
matched  end  has  no  symmetry  with 
respect  to  propagation.  In  other  words, 
pgB  is  connected  to  exactly  l  node  phB 
which  is  a  consistent  contending  match 
with  refA  ). 

In  the  following  proofs  (Theorems  4  and  5),  suppose 
the  algorithm  terminates  after  K  time  steps.  In  other 
words,  TK  =TK  '.  In  the  following,  I  sometimes  refer  to 
"matchable  by  propagation  at  time  t”  by  simply, 
"matchable". 

Theorem  4:  If  an  arc  rpfA  is  matchable  by  propagation 
at  time  t  and  the  network  is  still  active  (ie.  T1  =  T* ') 
then  the  network  assigns  that  arc  a  unique  consistent 
match  at  the  end  of  time  t  + 1 

Proof: 

Case  1  (condition  (ii)  of  definition  above  is  true): 
From  the  network  topology,  Xpfgh  =  Xpg  +  Xph  + 
Xfg  +  Xm  a^er  propagation  substep 
beginning  time  step  t+1.  From  condition  (ii), 
Xpfgh  =  2p.  But  Xpfghis  the  largest  potential  in 
its  row: 

V  x,y  [  -dx  =  g  &  y  =  h)  3  (xpfgh  >  Xpfxy)  ] 


because  Xefxy  =  X„ 


+  Xpy  +  Xfs  +  Xfy,  But  at 


most  one  of  these  terms  is  p  (if  x  =  g  or  y  =  h)  and 
the  rest  must  be  zero,  because  pe  is  bound  to  pg 
and  pf  is  bound  to  ph.  Xpt.gh  is  also  the  largest 
potential  in  its  column,  by  an  analogous 
argument. 

Therefore,  after  the  WTA  substep,  Spfgh  = 
bound. 

Case  2  (condition  (iii)  of  definition  above  is  true): 
Only  the  example  case  is  proved.  The  other  3 
analogous  cases  are  identical,  with 
systematically  varied  subscripts. 

Again  Xef.gh  =  XeR  ♦  Xt*h  +  Xfg  +  X(h  after  tHe 

propagation  substep  beginning  time  step  t+1. 
But  Xeg  =  P  and  Xph  =  Xfg  =0  therefore.  Also, 
Overall,  X  f  .  =  P  +  1.  But  this 


Xm  =  1- 


potential  is  greater  than  all  others  in  the  row: 

V  x,y  [  -(x  =  g  &  y  =  h)  D  (xpf  gh  >  Xpr  xy)  ] 
because  Xpf  xy  =  Xpx  +  Xpy  +  Xfx  +  Xfy  and: 

(i)  suppose  x  =  g  or  y  =  h.  Then  1  term  is  p, 
say  Xeg  for  x  =  g,  without  loss  of 
generality.  Then  Xpy  is  certainly  zero. 
But  y  =  h,  so  both  the  other  terms  must 
be  zero  as  well,  from  the  implication  (*)  in 
condition 

(ii)suppose  neither  x  =  g  nor  y  =  h.  Then 
Xex  =  Xey  =  °-  Because 

contending,  Xfx  <  =  1  and  Xfy  <  = 
1-  S°Xpfxy  <p  +  1. 

Analogously,  the  potential  Xpfgh  is  greater  than 
all  others  in  the  column. 

Therefore,  by  the  completion  of  the  WTA  at  the 
end  of  time  step  t+1,  Splgh  =  bound. 
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Definition: 

a  node  ppA  with  label  lpA  is  matchable  by 
propagation  at  time  t  if 

(i)  3  pgH  [  Spg  =  contending  ] 

(pfA  can  match  some  node) 

&  (ii)  3  rpf\  rghB[  Spfgh  =  bound  & 

( (Sm  =  bound )  |  (lpA  =  lrA) )  ] 
(at  least  1  connected  and  consistent  arc 
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match  exists,  with  either  the  other  end 
matched,  or  dissimilar  labels  on  the  ends 
of  each  arc) 


&  (iii)  V  rexA  3ryzB  [  (x  ->  =  f  &  Sexy2  =  bound)  D 
(y  =  g&  z_1=h)  |  (z  =  g&y_,=  h)] 
(for  all  the  other  matched  arcs  connected 
to  e,  their  matches  are  consistent  with  the 
match  of  ef  to  gh) 


&  (iv) 


V  r,xB  3ry2A  [  (x  ->  =  h  &  SKXjr2  =  bound) 3 
(y  =  e&  z_’=f)  |  (z  =  e&  y_'=f)l 
(for  all  the  other  matched  arcs  connected 
to  g,  their  matches  are  consistent  with  the 
match  of  ef  to  gh) 


Theorem  5:  If  a  node  p„Ais  matchable  by  propagation 
at  time  t  and  the  network  is  still  active  (ie.  T‘  -  T' 1 ), 
then  the  network  assigns  that  node  a  unique  consistent 
match  at  the  end  of  time  t  +  I . 

Proof: 

If  is  bound,  it  must  be  that  S„his  off.  as  is  Sf|s. 
Alternately,  Sfj,  is  contending  and  1,A  -  1,A,  so 

again  St.h  and  Sf(,  are  off. 

Now,  after  propagation  =  Ia0  Xeo  RO  . 

Furthermore,  X^is  the  largest  potential  in  its  row: 
Vy  [  (y  -1  =  g  &  y  =  h)  D  XPs  >  Xpy  ] 
because: 

(a)  Srfgh  =  bound,  so  XH|(  receives  an  input  of 
1  from  n  ,  .  and  no  other  element  in  the 
same  row  as  m^  receives  an  input  for  St.f>!h 
( would,  but  as  we  just  saw,  from 
condition  (ii)  of  the  definition,  Seh  =  off). 

(b)  If  some  other  mpv  receives  an  input,  then 
some  ni-sv7is  in  state  bound  as  well.  (In 
other  words,  node  e  has  another  bound  arc 
connected  to  it).  But,  by  the  consistency 
condition  (iii),  if  S(.sv2  =  bound,  then 
either  y  =  g  or  z  =  g.  In  either  case,  X 
always  receives  an  inputof  1  as  well. 

Therefore,  taken  together,  (a)  and  (b)  dictate 
that  X  is  always  at  least  I  greater  than  all 
other  X  in  the  same  row. 


An  analogous  argument  using  condition  (iv)  of  the 
definition  holds  true  for  all  units  in  the  same 

column  as  X  . 

eg 

Therefore,  X„g  is  larger  than  all  other  units  in  its 
row  and  column  following  the  propagation  substep. 
Therefore,  after  the  WTA  substep,  S  =  bound. 

QKI) 

5.2.3  The  Network  Detects  a  Good  Match 

Finally,  we  must  prove  that  the  network  succeeds  in 
matching  all  the  matchable  graph  elements,  both 
nodes  and  arcs,  as  far  as  consistency  and  symmetry  will 
allow. 

Definition : 

the  set  of  graph  elements  matchable  by  propagation 
is: 

Ul  =  () ,  nodes  and  arcs  matchable  by  propagation(t) 
Definition: 

the  set  of  graph  elements  matchable  with  1  :al 
consistency  includes 

(i)  nodes  with  a  unique  unary  match 
&  (ii)  arcs  with  a  unique  binary  match 
&  (iii)  graph  elements  matchable  by  propagation 

Theorem  6:  The  network  assigns  a  unique  consistent 
match  to  every  graph  element  that  is  matchable  with 
local  consistency. 

Proof:  (Contradiction) 

Assume  the  network  converges  in  K  steps,  so  TK  = 
TK  ’.  Assume  that  there  exists  a  matchable  graph 
element  which  is  not  assigned  a  match.  By 
Theorems  2  and  3,  the  network  assigns  matches  for 
the  unary  and  binary  local  uniquenesses,  so  it 
cannot  be  these.  By  Theorems  4  and  5,  the  network 
matches  those  elements  which  are  matchable  by 
propagation  at  time  t  <  K  (where  the  algorithm 
terminates  after  K  steps  with  TK  =  TK  '). 
Therefore,  if  a  matchable  graph  element  exists  that 
is  not  matched,  it  must  be  matchable  by 
propagation  for  some  t  >  K-l,  say  t  =  C.  To  reach 
this  state  at  t  =  C  (which  is  different  from  the  state 
at  t  =  K-l),  the  network  must  change  state  during 


every  timestep  from  t  =  K-l  to  t  =  C.  ButTK=  TK  ', 
and  the  network  does  not  change  state  after  t  =  K- 1 . 


QED 

5.3  Ramifications 
5.3.1  Time  Complexity 

There  are  some  interesting  ramifications  of  the 
correctness  characteristics  of  the  algorithm.  First,  it  is 
possible  to  show  that  the  network  converges  in  a 
number  of  time  steps  linear  in  the  number  of  graph 
elements  being  matched. 

Suppose  we  designate  NnA  and  NaA  as  the  number  of 
nodes  and  arcs  respectively  in  graph  A,  and  designate 
N  B  and  N  B  the  same  way.  Now  call  SIZEA  =  N  A  + 
NaA,  and  similarly  for  graph  B.  In  the  following  proof,  I 
also  use  the  concept  of  a  propagation  path.  Consider 
some  node  which  is  matchable  by  propagation  at  time  t. 
Then  there  exists  one  particular  associated  seed  match 
from  which  the  match  propagated  to  the  node. 
Furthermore,  there  exists  some  path  from  that  seed 
match  to  the  node  along  which  the  match  propagated. 
Finally,  if  both  nodes  and  arcs  are  counted  in  the  path 
length  (but  each  is  counted  at  most  once,  if  the  path 
contains  a  cycle),  the  length  of  the  propagation  path 
from  the  associated  seed  match  to  the  node  must  equal 
exactly  t. 

Theorem  7:  The  network  converges  in  K  time  steps, 
where  K  <  =  min(SIZE\ SIZE0). 

Proof: 

Consider  the  smaller  of  the  two  graphs  if  they  have 
unequal  sizes,  and  suppose  without  loss  of 
generality  that  it  is  graph  A.  The  maximum  length 
of  a  propagation  path  in  graph  A  is  SIZE'. 
Therefore,  the  maximum  time  it  can  take  for  the 
seed  match  to  propagate  to  worst  case  node  match  is 
SIZE  '  steps.  Furthermore,  propagation  only  occurs 
when  matches  are  being  made.  Therefore,  while 
graph  B  might  have  a  longer  path,  propagation  can 
only  occur  for  the  shorter  length  of  time. 
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One  of  the  interesting  aspects  of  this  result  is  what  it  is 
not.  That  is,  one  might  expect  that  the  longest 
propagation  path  would  be  equal  in  length  to  the 
longest  path  between  any  2  nodes  in  the  graph  (the 
diameter  of  the  graph).  This  is  not  true,  however, 
because  propagation  depends  not  only  upon  the  path 
taken,  but  also  upon  when  the  propagation  steps  occur 
and  whether  or  not  symmetries  are  present  at  a  given 
time.  Sometimes  a  node  becomes  matchable  by  a  very 
roundabout  propagation  path. 

5.3.2  Isomorphisms  and  Near  Matches 

Suppose  NnA  =  NnB  and  NaA  =  NaB.  If  all  the  nodes 
and  arcs  of  GA  are  matchable  with  local  consistency,  the 
matching  computed  by  the  network  is  the  unique 
isomorphism. 

But  suppose  the  sizes  are  equal,  but  more  than  one 
isomorphism  exists  between  the  graphs.  Then,  not  all 
the  nodes  and  arcs  of  the  graphs  are  matchable  with 
local  consistency.  But  what  does  this  mean?  Well,  for 
this  to  be  true,  there  must  exist  some  kind  of  local 
symmetry  in  the  graph,  beyond  which  the  constraint  of 
consistent  matching  cannot  propagate.  In  this 
situation,  all  (ambiguous)  isomorphic  matches  between 
the  graphs  are  represented  in  the  final  state  of  the 
network,  by  competing  matching  units  whose  state  has 
stabilized  at  contending.  Clearly,  if  necessary,  case 
analysis  could  be  used  to  further  disambiguate. 

Finally,  we  must  consider  what  happens  if  the 
graphs  are  completely  dissimilar.  That  is,  either  their 
sizes  differ,  or  their  structure  differs,  or  both.  It  is  still 
true  that  the  network  computes  all  seed  matches,  and 
propagates  each  match  a  far  as  consistently  possible. 
Thus,  all  elements  matchable  with  local  consistency 
are  assigned  matches.  The  final  matching  has  the 
largest  possible  locally  isomorphic  subgraphs  matched 
(with  each  containing  a  seed  match). 

6.  Discussion  and  Conclusions 

The  central  contribution  of  this  paper  has  been  to 
describe  a  connectionist  implementation  of  discrete 
relaxation  for  labelled  graph  matching,  and  prove  the 
correctness  of  its  operation.  The  general  utility  of 
discrete  relaxation  is  well  known,  and  the  abstract 
form  of  the  algorithm  is  well  understood.  But  the 


specifics  of  the  connectionist  implementation 
illuminate  some  interesting  issues. 


binary  constraints,  with  propagation  occurring  only 
under  very  specific  conditions). 


First,  with  the  connectionist  implementation  the 
complexity  of  the  algorithm  can  be  made  very  precise. 
The  exact  space  and  time  requirements  of  the 
algorithm  have  been  specified.  The  explicit  nature  of 
the  implementation  makes  extending  the  algorithm  in 
parallel  trivial,  and  its  resource  requirements  remain 
explicit  and  easy  to  compute.  The  theoretical  speed 
which  one  expects  to  obtain  with  massively  parallel 
implementations  of  algorithms  is  provably  present. 
With  many  implementations  of  relaxation  (eg. 
[Kitchen  and  Rosenfeld  1979])  the  time  and  space 
requirements  of  the  algorithm  are  not  specified. 

Second,  it  is  interesting  to  observe  how  working 
within  the  connectionist  paradigm  constrains  a  design. 
The  requirement  that  there  be  no  interpreter  means 
each  relevant  value  must  be  represented  by  a  unit. 
This  suggests  in  the  context  of  the  graph  matching 
problem,  that  representing  matchings  for  ternary  or 
higher  relations  would  become  prohibitively  expensive. 
Furthermore,  bandwidth  limitations  on  the  messages 
the  units  can  send  each  other  (inspired  by  neural 
communication  bandwidth  limitations)  restrict  the 
character  of  what  can  be  computed  with  unary  and 
binary  constraints.  For  example,  it  would  be  difficult 
to  modify  the  algorithm  so  that  more  than  one  unary 
predicate  could  be  used  (like  [Kitchen  and  Rosenfeld 
1979]  do).  Likewise,  bandwidth  limitations  restrict  the 
potential  utility  of  the  unary  predicate  itself. 
Interestingly,  the  overall  resource  requirements  of  the 
algorithm  suggest  an  upper  bound  on  the  number  of 
parts  a  structure  representation  can  have  before  a  more 
efficient  representation  such  as  a  hierarchy  is  required. 
(Structure  representations  with  more  than  a  small 
constant  number  of  parts,  like  in  the  small  double 
digits,  would  become  unwieldy.) 

The  paper  contains  somewhat  stronger  and  more 
specific  results  about  what  can  be  matched  than  are 
typically  found  (cf.  the  definitions  of  matchable  by 
propagation  for  nodes  and  arcs).  These  results  were 
obtainable  because  the  exact  nature  of  the  algorithm 
was  very  tightly  constrained  (eg,  only  unary  and 


The  particular  performance  characteristics  of  the 
algorithm  suit  the  task  problem  -  fast  indexing  from 
structure  descriptions  -  very  well.  Complete 
recognition  (with  verification)  requires  that  a  complete 
correspondence  between  object  and  model  be 
established,  and  this  algorithm  is  unable  to  do  so.  But 
it  can  be  used  to  quickly  reject  large  numbers  of  model 
candidates,  so  later  more  expensive  processes  can  be 
more  confidently  and  efficiently  applied  to  the 
remaining  candidates. 

Finally,  the  implementation  and  proof  provide  a 
basis  from  which  to  investigate  more  general  indexing 
problems.  In  future  work,  I  plan  to  investigate 
indexing  from  structure  with  uncertain  and  inexact 
structure  descriptions.  Connectionist  implementations 
seem  to  be  natural  hosts  for  such  problems. 
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Abstract 

We  develop  a  model-driven  approach  for  extracting  generic 
shapes  from  monoscopic  and  stereographic  imagery.  Generic 
geometric  constraints  are  used  to  obtain  model  candidates  and 
to  suggest  missing  model  components  in  the  image  data.  Com¬ 
peting  scene  parses  are  ranked  using  objective  functions  that 
include  the  cost  of  encoding  the  scene  information  in  terms  of  a 
set  of  generic  primitives  and  a  measure  of  the  geometric  qual¬ 
ity  of  the  parse.  As  examples,  we  formulate  generic  models  for 
buildings,  roads,  and  vegetation  clumps.  The  model  definitions 
include  edge  characteristics,  two-  and  three-dimensional  geomet¬ 
ric  properties,  region  characteristics,  and  procedures  for  predict¬ 
ing  and  verifying  missing  shape  components.  Results  are  shown 
for  a  variety  of  aerial  imagery. 

1  Introduction 

A  basic  goal  of  computer  vision  -esearch  is  to  develop  sound  the¬ 
oretical  approaches  to  the  interpretation  of  information  derived 
from  imaging  sensors.  Scene  descriptions  should  be  phrased  in 
terms  of  semantic  models  relevant  to  specific  objectives;  the  pro¬ 
cess  of  generating  such  an  interpretation  is  typically  referred  to 
as  model-based  vision  [see,  e.g..  Brooks,  1981;  Binford.  1982],  In 
do'  highly  constrained,  model-driven  approaches 

Hollos  and  Horaud,  1986;  Shneier  et  al.,  1980]  have 
;  to  supplement  passive  model-based  methods  by  pre¬ 
dict  ..  and  aching  for  missing  model  parts  in  the  image. 

We  stiggi  nat  for  model-based  vision  to  be  effective  in  less- 
constrained  applications,  such  as  the  automatic  delineation  of 
cartographic  feat  tires  in  aerial  imagery,  generic  models  must,  play 
a  new  and  more  dynamic  role  in  all  stages  of  the  analysis.  The 
components  of  our  “generic-model-driven"  approach  to  scene  in¬ 
terpretation  are: 

•  Generic  Models.  We  use  generic  models  for  (asks  such 
as  automated  cartography  whose  object  classes  cannot  be 
fully  specified  in  advance  [see,  e.g.,  Fua  and  Hanson,  1985. 
1987a.b,c].  Models  that  are  too  specific  lack  the  breadth 
and  flexibility  necessary  to  guide  the  interpretation  process 
for  most 'real  problems. 

*  •  Active  Use  of  Models  for  Discovery  and  Prediction. 

Karh  model  definition  includes  mechanisms  for  generating 
and  verifying  missing-part  hypotheses  using  image-based 
computation.  Therefore,  the  models  themselves  play  an 
essential  role  in  finding  appropriate  candidate  instances  to 
be  evaluated  with  the  objective  functions. 

’This  res.-ar<  )>  was  siipporlrrj  j,i  p.irl  !>v  I  lie  Defense  Advanced  Itesearcti 
Projects  Agency  under  Contract  No  MDAOll.t-Sti-f'-tins-l. 


•  Image-Based  Objective  Functions.  For  each  model,  we 
define  an  objective  function  that  includes  measures  of  the 
quality  of  the  evidence  for  model  instances  in  the  observed 
image  data.  This  function  is  used  to  select  the  best  set  of 
compatible  model  candidates  to  describe  a  scene. 


In  order  to  generate  reasonable  hypotheses  for  shape  instances 
in  the  presence  of  noise  and  missing  structures,  all  procedures 
in  a  feature-discovery  system  should  be  carried  out  with  explicit 
reference  to  the  characteristics  of  a  set  of  models.  Model-based 
vision  cannot  be  effectively  carried  out  using  strictly  passive 
techniques,  where  extracted  shape  information  is  passed  to  an 
interpreter  that  makes  no  further  reference  to  the  image  data. 
'•Low-level”  vision  tasks  cannot  be  pursued  independently  with¬ 
out  a  clear  concept  of  what  modeling  information  will  be  re¬ 
quired  of  them;  similarly,  “high-level”  tasks  must  refer  directly 
to  the  image  to  verify  each  semantic  hypothesis. 

Our  approach  employs  a  comprehensive  theoretical  approach 
to  model-driven  image  understanding  based  on  encoding  a  scene 
in  terms  of  a  model  language  and  evaluating  the  result  with 
objective  functions.  We  also  introduce  practical  implementation 
techniques  that  enable  the  successful  application  of  the  model- 
driven  paradigm  throughout  the  scene-parsing  procedme. 


2  Theoretical  Foundations 

The  approach  10  model-based  vision  that  we  advocate  seeks  to 
find  scone  interpretations,  phrased  in  terms  of  a  particular  model 
vocabulary,  that  optimize  an  objective  function. 

This  theoretical  approach  would,  in  principle,  require  an  ex¬ 
haustive  evaluation  of  an  enormous  static  set  of  alternate  inter¬ 
pretations.  In  practice,  one  would  consider  generating  only  a 
limited  set  of  model  hypotheses  using  the  output  of  syntactic 
operators.  But  simply  evaluating  the  objective  function  on  the 
output  of  a  simple  low-level  operator  will  seldom  give  reason¬ 
able  results  on  complex  images.  The  implementation  must  have 
effective  procedures  for  producing  relevant  model  candidates  for 
the  objective  function  to  measure. 

A  candidate-generating  procedure  may  produce  many  conflict¬ 
ing  proposals;  the  objective  function  selects  an  optimal  set  of 
compatible  hypotheses  from  among  the  competing  alternatives. 

To  clarify  these  separate  tasks,  we  have  separated  our  discus¬ 
sion  into  two  major  sections:  the  current  section  on  theoretical 
foundations,  which  emphasizes  the  abstract  requirements  of  the 
optimization  approach  regardless  of  how  the  model  candidates 
are  acquired,  and  a  subsequent  section  on  implementation,  which 
illustrates  appropriate  techniques  for  generating  optimal  model 
candidates  while  limiting  undesirable  combinatorics. 
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2.1  Generic  Modeling 

People  can  accurately  classify  instances  of  various  object  cat¬ 
egories  even  though  a  particular  instance  may  have  a  unique 
shape  that  they  have  never  seen  before.  The  introduction  of 
generic  shape  classes  is  necessary  to  automate  this  human  abil¬ 
ity. 

The  generic  models  that  we  have  found  useful  for  analysis  of 
real  images  possess  the  following  properties: 

•  Strong  edge  geometry.  Elementary  edge  or  line  data 
extractable  from  an  image  must  be  related  in  some  direct 
and  computable  way  to  the  object.  Typical  edge  models 
include  such  characteristics  as  long,  straight  segments,  uni¬ 
form  local  curvature,  and  statistical  signatures  indicating 
jaggedness.  If  we  now  define  an  optimal  edge  pixel  to  be  a 
pixel  that  is  a  maximum  of  the  local  image  gradient  in  the 
direction  of  this  gradient,  the  edge  score  can  be  taken  ~.s 
the  ratio  of  the  number  of  optimal  edge  pixels  to  the  num¬ 
ber  of  pixels  that  are  not  optimal.  This  approach  amounts 
to  counting  the  number  of  pixels  that  are  step-edge  pixels 
according  to  a  standard  definition  [Haralick,  1984;  Canny, 
1986].  The  resulting  ..core  provides  a  good  measure  of  how 
well  a  set  of  edges  fits  the  photometric  data  after  the  geo¬ 
metric  model  has  been  imposed  [Fua  and  Leclerc.  1988]. 

•  Strong  area  signature.  Areas  contained  within  a  generic 
object  should  be  characterizable  by  a  computable  signature, 
such  as  uniform  or  uniformly  changing  intensity  values  or 
textures.  For  real  objects,  such  as  parking  lots  with  cars 
or  roofs  with  chimneys,  anomalies  are  to  be  expected;  such 
anomalies  are  easily  located  and  discounted,  provided  they 
constitute  no  more  than  a  small  fraction  of  the  area.  Our 
requirements  can  be  met  by  the  following  fitting  procedure: 
We  find  the  peak  of  the  gray-level  histogram  of  the  pixels 
within  a  patch,  fit  a  plane  to  those  pixels  that  lie  within  the 
peak,  and  compute  a  histogram  of  the  deviations  of  those 
pixels  from  the  fit.  In  general,  the  deviation  histogram  will 
exhibit  a  main  peak  and  several  smaller  peaks  correspond¬ 
ing  to  the  anomalies.  The  area  score  is  then  taken  to  be  the 
average  number  of  bits  needed  to  encode  the  histogram  as 
a  sum  of  Gaussian  distributions. 

•  Optional  stereographic  signature.  When  stereographic 
information  is  available,  structures  in  one  image  can  be 
backprojected  to  matching  structures  in  the  other  image. 
The  quality  of  this  match  may  be  computed  using  a  combi¬ 
nation  of  the  technique  for  area  matching  just  described  and 
the  measure  used  by  Barnard  [1988].  That  is.  we  compute 
the  histogram  of  the  intensity  differences  between  pixels  in 
the  patch  in  one  image  and  the  pixels  in  the  other  image  ob¬ 
tained  by  projection  of  the  hypothesized  three-dimensional 
position  of  the  surface.  The  stereo  match  score  is  then  the 
number  of  bits  needed  to  encode  the  histogram  of  intensity 
differences  as  a  sum  of  Gaussian  distributions. 

•  Predictability  of  incomplete  structures.  Edge  geome¬ 
try  and  area  signature  are  relatively  straightforward  to  eval¬ 
uate  if  we  are  given  a  perfect  model  candidate,  but  typical 
procedures  for  generating  model  hypotheses  produce  incom¬ 
plete  structures.  We  therefore  provide  modet-drire  n  proce¬ 
dures  that  search  for  incomplete  components  of  the  merle  I 
in  the  image  nj. 


Generic  modeling  is  thus  characterized  by  exploitation  of  ge¬ 
ometric  edge  constraints  and  photometric  area  properties,  com¬ 
bined  with  the  ability  to  fill  in  missing  model  components  in 
order  to  generate  reasonable  model  candidates  for  evaluation  by 
the  objective  function. 

The  models  that  we  have  implemented  to  date — buildings, 
roads,  and  trees — are  relatively  simple.  They  are  defined  as 
circular  graphs  of  edges  that  are  geometrically  related  and  can 
be  completed  to  form  closed  contours  conforming  to  the  model 
geometry.  The  model  definition  therefore  consists  of  the  follow¬ 
ing  components:  (1)  edge  definition  and  geometric  signature, 
(2)  specification  of  an  area's  pfi  tometric  signature  and  filter¬ 
ing  thresholds,  (3)  definition  of  geometric  relationships  among 
edges,  and  (4)  procedures  that  produce  closed  contours  from  a 
circular  graph  of  related  but  disjoint  edges.  For  roads  and  build¬ 
ings,  all  four  elements  are  present;  for  trees,  which  are  much 
less  constrained,  the  strong  geometric  restrictions  are  absent. 
There  is  no  intrinsic  obstacle  to  defining  much  more  complex 
hierarchies  of  objects,  but  we  have  not  yet  implemented  such 
definitions. 

2.2  Model-Based  Evidence  Measures 

The  objective  function  is  a  combined  measure  of  the  deviation  of 
a  model  candidate  from  the  image-based  evidence  for  the  model 
and  the  geometric  quality  of  the  candidate. 

We  choose  the  image-based  measure  F  to  be  the  effectiveness 
of  a  model  instance,  which  we  define  as  the  difference  between  the 
number  of  bits  needed  to  encode  the  photometry  of  a  scene  patch 
without  the  model  versus  the  number  required  with  the  model. 
In  cases  where  the  cost  of  encoding  the  model  itself  can  be  ne¬ 
glected,  effectiveness  is  the  number  of  bits  of  information  that 
are  saved  by  describing  the  image  information  in  terms  of  the 
model  rather  than  without  it;  effectiveness  thus  measures  good¬ 
ness  of  fit . 

The  geometric  quality  measure  G  is  represented  by  a  cost  that 
we  will  argue  is  related  to  the  probability  of  the  occurrence  of  a 
model  instance's  geometric  characteristics. 

For  a  set  of  hypothesized  model  instances  denoted  by  M  = 
{in},  the  overall  score  of  a  scene  parse  is 

S  -  F  -  G.  (1) 

where  F  =  Y,meM  *s  l'le  tota*  effectiveness  of  the  scene  rep¬ 
resentation  and  G  =  Yfm^\fGm  is  the  total  geometric  cost  of 
the  parse.  These  definitions  implicitly  assume  statistical  inde¬ 
pendence  of  the  individual  model  instances.  For  complex  scenes, 
we  should  in  principle  express  F  and  Gina  way  that  accounts  for 
the  fact  that  complex  models  are  far  from  statistically  indepen¬ 
dent;  buildings  are  correlated  with  roads,  bridges  with  rivers, 
and  so  forth.  For  our  present  purposes,  the  independence  as¬ 
sumption  is  a  good  approximation  and  makes  the  optimization 
problem  computationally  tractable. 

Photometric  Evidence  Sources.  The  photometric  effec¬ 
tiveness  term  F  for  each  model  candidate  contains  terms  cor¬ 
responding  to  the  set  F  =  { r }  of  evidence  sources  being  consid¬ 
ered  : 

(2) 

r  € 

We  assume  that  wo  have  chosen  sources  of  evidence  whoso 
interdependence  can  ho  neglected.  Wo  will  show  below  how 
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the  measures  Fm  e  can  be  computed  in  practice  in  a  noiseless 
information-theoretic  coding  framework. 

We  reiterate  below  the  photometric  evidence  computations 
applied  to  each  model  candidate  in  our  implementation: 

•  Area  Intensity  Characteristics.  Find  the  peak  of  the 

Cray-level  histogram  of  the  pixels  within  a  patcit,  hi  a  plane 
to  those  pixels  that  lie  within  the  peak,  and  compute  a  his¬ 
togram  of  the  deviations  of  those  pixels  from  the  fit.  The  ev¬ 
idential  measure  is  the  number  of  bits  needed  to  encode  the 
histogram  of  deviations  as  a  sum  of  Gaussian  distributions 
[Leclerc,  1988],  i.e.,  ]T,  n,  log2  cr,  +  «.  log2(n,/A')  +  c A. 

where  a,  and  n,  are  the  width  and  number  of  pixels  in 
each  Gaussian,  N  is  the  total  number  of  pixels,  and  c  = 
O.o(log2(eir)+  1). 

•  Edge  Quality  and  Geometry.  Select  edge  segments 
matching  the  model  geometry  and  compute  the  edge  mea¬ 
sure  from  the  ratio  of  the  number  of  optimal  edge  pixels  to 
the  number  of  pixels  that  are  not  optimal. 

•  Stereographic  Matching  Characteristics.  Compute 
the  number  of  bits  needed  to  encode  the  histogram  of  inten¬ 
sity  differences  of  stereo- matched  pixels  as  a  sum  of  Gaus¬ 
sian  distributions. 

Additional  possible  sources  of  evidence  include  texture  mea¬ 
sures.  range  characteristics,  and  continuity  of  motion  in  a  motion 


Geometric  Evidence  Sources.  The  geometric  measure  G 
enforces  geometric  constraints  on  the  model  candidates.  When 
we  are  dealing  with  a  limited  number  of  shape  candidates  whose 
very  discovery  was  prompted  by  a  prejudice  toward  a  particular 
model  shape,  all  of  the  shape  candidates  are  guaranteed  to  obey 
quite  strictly  the  geometric  requirements  of  the  corresponding 
model.  Thus  it  is  often  sufficient  to  invoke  only  the  image- 
based  photometric  evidence  measures  described  in  the  previous 
paragraph  and  G  may  be  neglected.  However,  in  scenes  with 
substantial  complexity  and  ambiguous  photometry,  it  becomes 
necessary  to  include  G  in  Eq.  (1)  to  differentiate  parses  on  the 
basis  of  geometric  quality.  Model-based  constraint  systems  such 
as  ACRONYM  [Brooks,  1981]  include  elaborate  capabilities  for 
dealing  with  such  geometrical  questions:  similar  facilities  could 
be  incorporated  into  our  framework  by  suitably  tailoring  G. 

Typical  examples  of  purely  geometric  measures  of  model  qual¬ 
ity  that  have  been  found  useful  are  rectilinearity.  jaggedness, 
compactness  of  the  area  of  a  delineated  model  element,  and  the 
absence  of  narrow  appendages  or  links.  Simple  versions  of  such 
measures  ran  be  effected  using  thresholds  on  such  quantities  as 
relative  angles  between  model  edge  components  and  the  com¬ 
pactness  of  a  delineated  area. 

2.2.1  Estimating  the  Photometric  Effectiveness. 

Our  explicit  choices  for  the  fundamental  effectiveness  measures 
in  F.q.  (2)  are  motivated  by  a  heuristic  derivation  based  on 
information-encoding  arguments  [see.  e.g..  feclerc  1988],  We 
assume  first,  for  simplicity,  that  we  are  working  with  a  two- 
dimensional  model  consisting  only  of  area  and  edge  properties. 
Ibis  type  of  model  is  in  fact  the  one  we  have  explored  most 
extensively.  Next,  we  show  how  rtereographic  properties  ran  be 


handled.  Further  properties  can  be  added  using  techniques  sim¬ 
ilar  to  those  employed  for  the  stereographic  evidence  measure. 

We  begin  by  considering  a  patch  of  contiguous  pixels  in  the 
image.  Since  we  are  going  to  consider  both  the  area  and  the  edge 
properties,  we  need  to  distinguish  edges  from  other  pixels  in  a 
meaningful  way;  to  this  end.  we  define  an  "edge"  as  a  maximum 
ol  the  local  image  gradient  measured  in  tiie  direction  of  tie 
image  gradient  at  the  edge  pixel  [Canny.  1986] .  Thus  we  need 
to  define  the  following  properties  of  an  image  patch: 

,-t  =  Area  of  the  patch  in  pixels 

L  =  Length  of  patch  boundary  in  pixels 

{Number  of  boundary  pixels  that  are  max-1 
ima  of  the  local  image  gradient  / 

n  =  L  —  n 

{Number  of  boundary  pixels  that  are  noil 
local  maxima  of  the  local  gradient  /  ’ 

If  we  treat  the  patch  as  though  it  were  a  random  background, 
with  no  available  model-based  interpretation,  we  are  led  to  de¬ 
fine 

jj  _  [Probability  that  a  random  pixel  in  the  image  is  al 
[maximum  of  the  local  image  gradient  / 

^  _  f  Average  number  of  bits  needed  to  encode  the  inten- j 

\sity  of  a  random  pixel  in  the  image  /  ‘ 

The  minimum  number  of  bits  required  to  encode  a  body  of 
information  is  the  negative  logarithm  of  its  probability  [Ris- 
sanen.  1987).  Assuming  now  that  pixel  values  are  uncor¬ 
related.  a  Hulfman-encoding  scheme  is  appropriate,  and  the 
minimum  number  of  bits  needed  to  encode  the  edge  proper¬ 
ties  of  all  boundary  pixels  in  a  random  background  patch  is 
(-1?  log2ft-  nlog2(l  —  ft)).  The  total  encoding  cost  in  bits  of  the 
model-free  patch  is  then 

B  =  kA  -  n  log2  ft  -  tflog2  ( 1  -  ft).  (3) 

Although  both  ft  and  k  can  be  estimated  by  direct  computation 
in  a  single  image,  one  technically  should  use  an  ensemble  of 
images;  in  practice  we  choose  these  quantities  to  be  heuristically 
determined  constants. 

Now  let  us  introduce  a  model  consisting  of  a  photometric  area 
signature  and  an  edge  geometry  requirement.  The  cost  of  en¬ 
coding  the  patch  using  the  model  is 

C  =  m.4  -  n  log2  p  -  nlog2  ( 1  -  p).  (-1) 


{Average  number  of  bits  needed  to  encode  the  devia- j 
tion  from  the  model  of  the  pixel  values  in  the  patch/ 

{Probability  that  a  boundary  pixel  of  an  actual  ob-j 
ject  is  a  maximum  of  the  local  image  gradient  J 

Note  that  p  requires  prior  knowledge  of  the  nature  of  each  im¬ 
age  with  respect  to  the  goal  of  interpreting  that  image  in  terms 
of  a  particular  model  set.  Therefore,  it  is  seldom  precisely  com¬ 
putable  and.  like  ft  and  k,  will  be  chosen  in  practice  as  a  heuristic 
parameter. 

The  effectiveness  of  a  model  description  now  becomes 


F  ~  B  -  ('  =  {h  -  m )A  4-  n  log2  j  +  n  log.  ^ 

b  ( 1  -  b) 


a’.av»VtV»w 

$ 


1 
1 


% 
* 


■•->>>» 


Vo 

0.0  -/o 


»*«'J 

i 


$ 

% 

i 


»‘l 

•fl 


3 


where  the  coefficients  m.  .4,  n  and  n  are  evaluated  for  each 
patch. 

The  measure  F  has  a  number  of  qualitative  features  tliat  cor¬ 
respond  well  with  what  we  need  for  scene  evaluation.  For  exam¬ 
ple,  we  see  that  the  value  increases  when  the  fit  to  the  area  model 
improves,  since  k  is  fixed  and  m  approaches  zero  for  constant 
intensities,  and  that  the  value  also  increases  when  the  number 
of  local  maxima  i  f  the  image  gradient  th»t  are  present  in  the 
region  boundary  increases  relative  to  the  number  of  boundary 
pixels  that  are  not  local  maxima. 

Equation  (5)  is  thus  a  realization  of  the  abstract  objective 
function  Eq.  (2)  that  we  use  to  evaluate  two-dimensional  scene 
parses. 

In  order  to  account  for  the  presence  of  stereographic  informa¬ 
tion,  we  add  to  the  objective  function  the  term 

/stereo  =  B,  -  C,  =  ik,  -  m,)A.  (6) 

where  m,  is  the  number  of  bits  needed  to  encode  the  histogram 
of  the  intensity  differences  between  corresponding  pixels  in  the 
candidate  stereo  match,  and  ks  is  the  average  number  of  bits 
required  to  describe  the  histogram  of  differences  between  two 
random  patches.  Although  a  reasonable  value  for  ks  could  be 
computed  directly  for  each  stereo  pair,  we  will  treat  k,  as  a 
heuristic  constant  just  as  we  treat  k. 

2.2.2  Maximum  Likelihood  Interpretation 

Minimal  encoding  and  maximum  likelihood  are  closely  related 
[see.  e.g..  Cheeseman,  1983],  We  now  show  how.  under  certain 
simplifying  assumptions,  the  heuristic  cost  function  that  we  op¬ 
timize  ran  be  reduced  to  the  logarithm  of  a  maximum  likelihood 
measure.  To  see  this  relationship,  we  consider  the  probability 
P  that,  given  the  evidence  E  =  { e, } ,  parsing  the  scene  in  terms 
of  a  particular  set  of  model  instances  M  =  {mo.  m,:i  =  1  . . . »} 
is  in  fact  correct.  Here  the  indices  i  refer  to  model  instances 
such  as  delineated  objects,  and  m0  represents  the  background 
model,  i.e.,  the  set  of  all  pixels  in  the  scene  that  do  not  belong 
to  delineated  objects  and  for  which  no  evidence  is  obtained. 

Iterating  the  decomposition  formula  p[a,b\c)  =  p(a\b,r)i>(b\c). 
we  may  express  P  as 

P  =  p(m0.m  . . mn|c, . c„) 

=  p(m0|fi . en )  n  pi  m^l mo . i-i  ■  f  l . en  )•(  i ) 

I 

where  p(  m,jmo m,_i .  e i f„ )  is  the  probability  that  a  hy¬ 
pothesized  model  instance  m,  is  correct  given  both  the  available 
evidence  and  the  presence  of  a  subset  of  other  models. 

We  now  make  two  assumptions  about  the  nature  of  relation¬ 
ships  between  the  models  and  the  evidence  that  lead  to  a  useful 
approximation  for  the  scene-parse  probability: 

•  Independence  of  each  model  from  th"  other  evi¬ 
dence.  If  the  body  of  evidence  c,  refers  to  the  i-tli  model, 
we  assume  tliat  m,  is  affected  by  its  own  evidence  c,  and 
the  presence  of  the  other  model  instances  n/jj!,.  but  is  in¬ 
dependent  ofrJ5i,.  Then 

pi  m0|f  i . r„)  =  pi  mu) 

p(  m  1 1  m() . m,_|. 

'i . . =  /»(»',  |t»u . m,„i.r  | ) 

pi  m,i . ,lm.) 

=  — ; - -pi»',)- 

pi  >»o . >»,- 1  •' , ) 


•  Independence  of  the  evidence  from  the  other  mod¬ 
els.  We  also  assume  that  the  evidence  relative  to  model 
instance  m,  is  independent  of  the  presence  of  other  model 
instances, so  that 

p(m0 . m,_  i.e,)  =  p(ma . m;_j  )/>(<?,-) 

p(m0 - ,m,_1,e,jm,)  =  p(m0 . m,-i  |m,  )p(c,jm, ). 

Using  these  two  assumptions,  we  have 


fi . e . c„)  = 


pi  mp . m,_i|m,)p(c,|m,) 

P(m  o . m,-\ )p(c,) 

,  ,  ,p(ei|m.) 

Pi  m,\m0 . m , - 1 ) -  - — — . 


where  we  used  Bayes-  rule  to  combine  terms  in  the  last  line. 
Thus  we  finally  have 

T-r  i  u(r,j  ni;) 

P  =  p(»io)j.J,  p(»i.'l"to . m,-i)  , 


=  . n 

i  i 

=  p{m° . 


-r  /r(c,|m, ) 
1  P(e,) 


We  now  write 


log2  P=F-G. 


F  =  iog2n 


p(e,|m,J 


=  ^(loe._,  p(e,h»;}  iug2  p{  c, ) ) 

i 

G  =  -  log2  p(m0, - m„). 

and  argue  that  F  and  G  are  identifiable  with  the  total  effective¬ 
ness  and  geometric  cost  defined  in  Eq.  (1). 

If  C,  denotes  the  number  of  bits  of  information  needed  to 
encode  the  observed  deviations  of  t lie  evidence  source  e,  from 
the  model  m„  and  U,  denotes  the  number  of  bits  to  encode  the 
evidence  in  the  absence  of  a  model,  we  have 

C,  =  -  log,  p(c,|m,j 

D,  =  -log  2p(c,). 

Thus  the  effectiveness  measures  defined  earlier  can  be  written 


P,  =  11,-0,  (10) 

and 

/•  =  Y. f,  <n> 

■ 

is  now  identifiable  as  the  total  effectiveness. 

Next,  in  order  to  identify  (!  with  the  cost  in  Eq.  ( 1 ).  we  assume 
that  we  may  express  the  joint  probability  term  in  Eq.  (8)  as 

pi>»ii . m„  1  = 


r,  -  log_,  k. 


I 


re? 


If  we  take  e,  to  represent  the  geometric  deviations  of  each  model 
instance  from  an  ideal  shape,  then  G  becomes  the  overall  geo¬ 
metric  cost  measure  to  within  a  constant. 

Assuming  the  form  (12)  for  the  joint  probability  expresses  the 
fact  that  parses  in  terms  of  instances  with  good  geometric  prop¬ 
erties  are  more  probable  than  others.  In  the  absence  of  any 
geometric  information,  we  would  set  e,  =  0,  thereby  effectively 
making  a  maximal  entropy  assumption  in  which  one  represents 
ignorance  by  assuming  all  prior  probabilities  to  be  equal.  How¬ 
ever,  it  is  obvious  that  Eq.  (12)  is  an  oversimplification  because 
complex  scene  structures  typically  contain  mutually  supporting 
evidence,  e.g.,  houses  on  streets  imply  the  existence  of  driveways. 
In  principle  one  should  capture  these  dependencies  instead  of  ig¬ 
noring  them. 

Thus,  while  the  cost  functions  used  in  this  work  not  rigorously 
equal  to  maximum  likelihood  measures,  there  is  a  very  close 
relationship  under  the  stated  assumptions. 

3  Implementation 

Recall  that  our  basic  goal  is  to  find  parses  of  a  scene  that  are 
optimal  with  respect  to  the  objective  function  defined  in  Eq.  (1) 
for  a  particular  set  of  models. 

Given  unlimited  computing  resources,  we  could  parse  a  scene 
in  terms  of  a  particular  model  vocabulary  and  generate  labels 
for  the  image  simply  by  evaluating  the  objective  function  for 
every  possible  combination  of  pixels  and  keeping  those  with  the 
best  scores.  Since  this  is  not  feasible  in  practice,  we  devote  this 
section  to  presenting  some  practical  techniques  for  achieving  our 
objectives  without  being  overcome  by  combinatorics. 

The  fundamental  concept  is  to  build  a  set  of  competing  and 
possibly  conflicting  model  candidates  using  a  dynamic  proce¬ 
dure,  and  then  to  select  the  set  of  compatible  hypotheses  that 
maximizes  the  objective  function.  This  is  in  fact  an  optimization 
procedure  with  the  following  components: 

•  State  the  parsing  goal  by  defining  the  model  vocabulary  to 
be  used  in  the  scene  description. 

t  Build  locally  optimal  model  primitives  and  retain  compet¬ 
ing  candidates  that  exceed  a  threshold  effectiveness  score. 
Recurse  in  a  hierarchy  of  model  primitives  if  appropriate. 

•  Select  a  subset  of  compatible  high-level  model  candidates 
that  maximizes  the  objective  function. 

We  first  discuss  the  domains  of  application  we  have  chosen, 
along  with  the  specific  models  we  have  used.  We  then  present 
the  basic  steps  in  the  current  implementation,  along  with  some 
examples  of  how  they  work  in  practice. 

Domains.  We  consider  two  principal  application  domains: 

•  Delineation  of  Two-Dimensional  Object  Models  in 
Aerial  Imagery.  Two-dimensional  feature  delineation 
is  required  for  the  const  ruction  of  maps  from  aerial  im¬ 
agery.  Even  when  human  analysts  use  stereographic  im¬ 
agery  for  this  task,  the  cartographic  output  is  always 
two-dimensional:  three-dimensional  content  is  indicated  by 
labels  giving  heights  of  building  complexes,  radio  tower 
heights,  and  elevation  contour  height.  Two-dimensional 
modeling  and  delineation  is  therefore  of  substantial  prac¬ 
tical  importance  in  its  own  right. 


•  Discovery  of  Three-Dimensional  Object  Models  in 
Multiple  Images.  Full  three-dimensional  data  are  re¬ 
quired  for  applications  such  as  flight  simulation,  simulation 
of  nonoptica)  sensors,  and  autonomous  ground-level  nav¬ 
igation.  The  information  present  on  typical  cartographic 
products  is  no  longer  sufficient  in  these  cases,  and  com¬ 
plex  three-dimensional  information,  including  perhaps  even 
texture  and  material-composition  maps  for  object  surfaces, 
may  be  required.  We  thus  consider  in  addition  the  construc¬ 
tion  of  full  three-dimensional  models  from  multiple,  possibly 
but  not  necessarily  stereographic,  images.  Other  modeling 
information  that  might  be  used  for  three-dimensional  recon¬ 
struction  includes  shadow  analysis  and  shape-from-shading 
models. 

Model  Vocabulary.  The  model  vocabularies  that  we  have  ex¬ 
amined  in  detail  include  buildings,  roads,  and  vegetation  clumps 
in  aerial  imagery.  The  building  model  assumes  that  roof  com¬ 
ponents  are  planes  having  rectilinear  edges  and  internal  areas 
with  planar  grey  level  intensities,  with  no  more  than  30%  of  the 
pixels  being  anomalous.  The  road  model  is  similar,  except  that 
it  assumes  smoothly  curved  edges  that  are  parallel  and  constant 
width  apart.  The  tree  model  assumes  jagged  edges  that  have 
strong  contrast  with  the  background  and  uniformly  textured  in¬ 
terior  areas.  All  models  can  in  principle  be  optimized  in  stereo, 
so  that  buildings,  elevated  highways,  and  trees  could  be  distin¬ 
guished  from  ground-level  background  structures  with  similar 
geometric  signatures. 

Parsing  Procedure.  The  steps  in  the  parsing  procedure  that 
we  have  implemented  are  the  following: 

•  Generate  Edge  Cues  from  a  Set  of  Segmentations. 

Our  system  uses  as  initial  shape  cues  segmentation  re¬ 
gion  boundaries  that  exhibit  the  appropriate  geometry,  e.g., 
straight  segments  for  buildings  and  elliptical  segments  for 
roads.  While  we  could,  in  theory,  start  with  the  output 
of  an  edge  detector,  starting  with  the  output  of  a  region 
segmenter  such  the  Laws  segmenter  [Laws,  198-1.  1988]  has 
proven  to  be  very  effective.  The  segmentation  regions  are 
optimal  in  the  sense  defined  by  Leclerc  [1988]  in  terms  of 
their  area  signature  and  are  therefore  good  cues  for  the  lo¬ 
cation  of  actual  objects.  Since  region  boundaries  may  not 
match  exactly  the  photometric  edges,  the  system  extracts 
the  edges  by  finding  breakpoints  in  the  boundary,  fitting 
curves  with  the  appropriate  geometry  between  those  break¬ 
points  and  optimizing  their  location  using  the  technique 
described  in  Fua  and  Leclerc  [1988].  The  resulting  edges 
are  scored  by  computing  the  ratio  of  pixels  that  are  max¬ 
ima  of  the  local  gradient  to  those  that  are  not.  and  only  the 
best  edges  are  retained. 

To  achieve  image  independence,  instead  of  using  a  single 
segmentation  wo  start  with  a  set  of  segmentations  con¬ 
structed  using  a  family  of  increasingly  permissive  param¬ 
eters.  then  extract  edges  from  all  tin-  segmentation  regions 
at  all  levels.  This  is  necessary  because  itt>  sini/lf  pnrnm- 
1 1<  I  .-(ttiny  ran  In  uptrlid  In  lianilh  all  Innji  t  objects  in 
mu  miacjf ,  nntrlt  Its*  in  uutUiplt  nutuji  s.  In  Figure  ].  we 
show  a  typical  imago  containing  a  complex  suburban  bouse; 
Figure  2  illustrates  typical  problems  such  as  the  absence  of 
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relevant  shape  information  and  the  presence  of  excess  irrel¬ 
evant  information  encountered  when  one  attempts  to  use 
any  single  parameter  setting  for  segmentation  systems  or 
Canny  edge  maps. 

•  Construct  Compatible  Binary  Edge  Relationships. 

Pairs  of  edges  that  satisfy  the  geometric  model  constraints 
are  assigned  binary  edge  relationships.  At  first,  in  order 
to  avoid  combinatorial  explosion,  only  edges  that  belong 
to  the  same  segmentation  regions  are  considered.  In  the 
next  phase,  the  same  physical  edge  may  be  identified  in 
the  boundaries  of  several  regions  belonging  to  different  seg¬ 
mentations  in  the  set;  we  match  such  edges  across  the  set  of 
segmentations.  Using  these  matched  edges  and  the  already- 
constructed  binary  relationships,  the  system  attempts  to 
generate  new  relationships  by  considering  transitive  hierar¬ 
chical  relationships,  as  shown  in  Figure  3.  Since  different 
parts  of  the  same  object  may  appear  at  different  segmenta¬ 
tion  levels,  this  last  step  turns  out  to  be  important  because 
the  system  can  unify  shape  cues  that  have  been  discovered 
in  qualitatively  different  segmentations. 

Finally,  a  patch  in  the  image  is  associated  with  each  de¬ 
tected  binary  structure.  The  photometric  properties  of  this 
area  are  computed  and  only  structures  with  a  good  photo¬ 
metric  score  are  retained. 

•  Construct  Cycles.  Circular  lists  of  binary  relationships 
satisfying  the  geometric  model  constraints  are  then  con¬ 
structed.  To  avoid  combinatorial  explosion,  only  binary 
relationships  with  similar  photometric  characteristics  are 
used  to  generate  cycles.  In  Figure  4.  we  show  the  system’s 
symbolic  representation  of  these  cycles  as  circular  graphs  of 
geometric  relationships.  The  actual  edges  that  form  these 
cycles  are  shown  in  Figure  5(a). 

•  Build  Enclosures. 

Model-based  completion  procedures  are  now  invoked  to  close 
the  cycle  of  edges,  thus  generating  closed  contours.  To  com¬ 
plete  rectilinear  cycles,  we  use  a  modified  version  of  the  F~ 
algorithm  [Fischler  et  al..  1981]  to  find  paths  in  the  lat¬ 
tice  of  most  probable  parallel  and  perpendicular  edges:  to 
complete  curvilinear  cycles,  we  use  the  “energy  minimizing 
curve"  technique  described  in  Fua  and  Leclerc  [19S8].  Two 
such  closed  contours  are  shown  in  Figure  5(b).  This  strat¬ 
egy  is  extremely  effective  in  finding  such  things  as  faint 
road  edges  that  an  edge  detector  may  miss,  but  that  are 
paired  with  strong  edge  on  the  opposite  side  of  the  road. 
A  sequence  of  points  that  has  been  optimized  as  an  iso¬ 
lated  edge  may  move  nontriviaily  when  it  is  optimized  as  a 
subcomponent  of  a  more  complex  model. 

•  Select  Enclosures. 

Candidate  enclosures  are  then  scored  using  the  objective 
functions  given  in  Eq.  ( 1 ).  Enclosures  with  poor  area  char¬ 
acteristics  or  poor  edge  photometry  are  discarded.  In  gen¬ 
eral.  for  complex  scenes,  the  system  builds  a  set  of  overlap¬ 
ping  and  therefore  conflicting  enclosures  that  encompass  the 
candidate  model  instances  mentioned  earlier.  The  system 
chooses  a  subset  of  non-overlapping  enclosures  that  maxi¬ 
mizes  the  objective  function  defined  in  E<|.  (11- 


In  applications  where  the  system  uses  more  than  one  model 
type,  it  first  computes  the  enclosures  using  each  model 
alone,  then  finds  the  best  combined  subset  of  enclosures 
in  terms  of  the  objective  function. 

4  Results 

In  this  section,  we  show  results  of  running  our  system  in  a  fully 
automated  fashion  on  four  complex  images  with  very  different 
photometry.  The  implementation  described  above  requires  a  few 
arbitrary  thresholds  such  as  the  minimal  area  or  edge  score  of  an 
acceptable  object,  as  well  as  the  heuristic  parameters  k,  p  and  6 
of  Eq.  (5);  however,  we  emphasize  that  all  results  given  here  are 
obtained  by  running  the  system  with  a  single  set  of  thresholds 
and  parameters.  The  precise  form  of  the  objective  function  used 
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G  =  (n  +  n)  ^log2  -  -  log2  )  ’ 

The  choice  of  functional  form  for  G  gives  preference  to  com¬ 
pact  candidates  by  penalizing  long  perimeters,  therefore  select¬ 
ing  parses  in  which  the  objects  are  not  broken  into  unwarranted 
subparts.  The  choice  of  coefficients  ensures  that  high-quality 
edges  are  penalized  less  than  poor  quality  ones.  We  fix  the  val¬ 
ues  of  the  parameters  to  be  k  =  7.0, p  =  0.9.  q  =  0.01,  and 
impose  the  thresholds  m  <  7.0.  n/(n  +  n)  >  0.7.  Parsing  re¬ 
sults  are  relatively  insensitive  to  changes  in  the  parameters.  The 
value  7.0  is  interpretable  as  the  number  of  uncorrelated  bits  in 
one  of  our  typical  digital  gray-scale  Images:  0.9  is  approximately 
the  probability  that  a  boundary  pixel  of  an  object  is  an  optimal 
edge  and  0.01  is  the  approximately  the  ratio  of  optimal  edge 
pixels  to  non-optimal  pixels  in  the  images  we  have  been  working 
with. 

4.1  Buildings  and  Roads:  Rectilinear  and  Curvi¬ 
linear  Networks 

To  illustrate  the  behavior  of  the  system  on  monoscopic  data,  we 
have  chosen  two  images  that  are  especially  challenging  because 
of  shape  complexity  and  faint  edge  photometry.  When  the  first 
image.  Figure  1,  is  analyzed  using  the  building  and  the  road 
models  separately,  we  find  that  the  highest-scoring  set  of  non- 
overlapping  enclosures  consists  of  those  shown  in  Figures  G(a) 
and  (b).  Merging  the  two  models,  we  find  the  overall  parse  in 
terms  of  the  best  candidates  for  buildings  and  roads  shown  in 
Figure  C>(c).  We  see  that  conflicts  are  resolved  in  favor  of  the 
model  with  the  strongest  evidence.  Some  false  model  candidates 
remain  because,  in  the  absence  of  stereoscopic  or  shadow  infor¬ 
mation.  rectilinear  yard  areas  are  difficult  to  distinguish  from 
legitimate  buildings.  In  this  particular  case,  they  can  be  elim¬ 
inated  by  retaining  only  the  enclosures  that  have  the  highest 
score,  as  shown  in  Figure  (i(d).  The  incorporation  of  stereo  into 
the  analysis  is  discussed  below. 

The  next  image.  Figure  7.  also  contains  difficult -to- parse  cul¬ 
tural  structures  with  a  different  physical  scale.  Figures  8(a)  and 
8(1>)  show  the  building  and  road  candidates  found  by  applying 
the  models  separately;  all  of  the  apparent  buildings  but  one  were 
found.  When  the  two  models  are  merged,  we  find  the  results  in 
Figure  8(c):  by  retaining  only  the  best  enclosures,  we  get  the 
filial  parse  shown  in  Figure  8(d). 
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4.2  A  Group  Picture — Houses,  Roads,  and  Trees  5  Conclusions 


Vegetation  clumps,  typically  small  groups  of  trees,  are  comple¬ 
mentary  to  the  regular  cultural-object  models  we  have  described 
so  far.  Their  boundaries  are  typically  jagged  and  irregular.  Any 
compact  object  that  contrasts  strongly  with  its  background  con¬ 
text  and  has  no  components  that  are  road-like  or  building-like 
could  be  a  candidate  for  vegetation. 

We  generate  candidate  vegetation  clumps  by  optimizing  the 
locations  of  segmentation  region  boundaries.  To  compensate  for 
the  lack  of  geometric  constraints,  we  use  two  additional  thresh¬ 
olds  for  scenes  requiring  vegetation  discovery:  For  any  candi¬ 
date  instance,  we  require  a  minimum  acceptable  compactness, 
as  measured  by  (.4//>2),  where  A  is  the  area  and  />  the  perime¬ 
ter,  and  a  minimum  contrast,  as  measured  bv  the  ratio  of  the 
average  edge  strength  along  the  boundary  to  the  average  edge 
strength  inside  the  region. 

We  now  examine  the  monoscopic  image  in  Figure  0  and  parse 
it  with  respect  to  all  three  models  buildings,  roads,  and  trees. 
The  separate  results  for  each  model  are  shown  in  Figure  0(a) 
and  (b);  the  main  road  is  lost  because  of  its  low  contrast. 

The  tree  model  can  be  extended  to  the  stereographic  case 
by  requiring  jaggedness  in  the  relative  elevation  of  successive 
points  on  the  candidate  vegetation  curve;  this  characteristic  dis¬ 
tinguishes  vegetation  from  shadows  of  vegetation  and  other  ir¬ 
regular  ground  markings. 


4.3  Stereographic  Buildings 

When  stereographic  or  multiple  imagery  is  available,  we  gener¬ 
alize  the  model  to  account  for  elevation  of  the  rectilinear  compo¬ 
nents  above  the  ground  or  above  enclosing  building  components. 
Both  horizontal  and  slanted  roof  planes  are  permissible,  and  the 
three-dimensional  generic  model  consists  of  planar  surfaces,  each 
of  which  has  at  least  one  straight  edge  in  common  with  another 
.-airfare.  The  additional  term  used  toevaluale  stereo  match  qual¬ 
ity  in  the  objective  function  is 


^’stereo  ”  ("^  " 


(Id) 


where  m,  was  introduced  in  Fq.  (0). 

In  Figure  10.  we  show  a  stereo  pair  containing  a  complex  set 
of  buildings.  It  mining  the  monoscopic  procedure  ou  one  image, 
we  get.  the  rectilinear  outlines  shown  in  Figure  1 1.  Next,  we  use 
t  he  camera  models  to  back-project  these  outlines  into  t  he  second 
image  and  use  the  strrcogrnphic-mat riling  objective  function  to 
optimize  the  height  of  the  building  sections.  We  thus  obtain 
t  he  t liree  dimensional  struct  ures  t hat  exhibit  tin*  best  agreement 
with  the  model  const  raints  and  the  evidence  in  the  stereograph  ic 
image  data.  In  Figure  12.  we  selectively  display  the  structures 
that  have  significant  elevation  with  respect  to  the  ground. 

Since  we  know  the  three  dimensional  position  of  each  recti 
linear  contour,  we  can  now  generate  1  liree  dimensional  objects 
having  the  observed  two  dimensional  upper  surface.  When  we 
supply  t  lie  automatically  computed  surface  contours  of  t  he  large 
building  in  Figure  1()  to  the  SKI  cartographic  modeling  system 
[Hanson  H  a!..  1!)K7;  Hanson  and  Quani.  lOSS],  we  can  generate 
synthetic  I  hrei*  dimensional  views  of  the  scene  such  as  the  one 
shown  in  Figure  Id. 


In  this  work,  we  have  proposed  a  scene-modeling  approach  based 
on  a  hypothesis-generation  technique  that  uses  active  optimiza¬ 
tion  to  discover  generic  model  components  in  digital  images. 

The  objective  function  that  is  optimized  measures  the  close¬ 
ness  of  each  model  hypothesis  to  an  ideal  model  instance  and  to 
the  evidence  supporting  its  realization  in  the  image  data. 

We  apply  this  objective  function  in  practice  by  generating 
model  candidates  from  segmentation- based  cues  and  by  opti¬ 
mizing  the  geometry  of  the  model  candidates  with  respect  to 
the  measure.  The  model- instance  discovery  process  optimizes 
the  model  interpretation  by  referring  directly  to  the  image  data 
at  all  levels  of  the  scene  parsing  process. 

We  have  used  generic  geometric  models  that  include  a  combi¬ 
nation  of  edge,  photometric,  and  stereographic  characteristics  to 
delineate  several  classes  of  objects  in  aerial  images.  Invocation 
of  a  particular  model  or  sot  of  models  effectively  sets  a  goal  for 
the  image  interpretation  process,  giving  us  the  freedom  to  parse 
the  same  image  with  different  objectives. 

The  system's  effectiveness  derives  from  the  use  of  the  objec¬ 
tive  function  philosophy  to  optimize  the  entire  interpretation 
process  with  respect  to  the  evidence  in  the  image.  In  future 
work,  we  plan  to  extend  the  range  and  complexity  of  the  models 
supported,  to  integrate  geometric  measures  more  closely  with 
the  photometric  effectiveness  in  our  overall  framework,  and  to 
include  high-level  interdependencies  among  model  instances  to 
support  complex  cartographic  analysis  requirements. 
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Figure  2:  (a)  A  Laws  segmentation  with  an  undersegmented 
roof,  (b)  Oversegmentation  resulting  from  differ¬ 
ent  parameter  choice,  (c)  A  Canny  edge  map  show¬ 
ing  a  scarcity  of  object,  edges,  (d)  Excessive  irrele¬ 
vant  edges  resulting  from  a  different  edge  threshold 
choice. 


Figure  3:  Construction  of  binary  relationships  among  elemen¬ 
tary  edge  structures  across  a  set  of  segmentations. 


Figure  4:  The  symbolic  relationships  among  the  edge  compo¬ 
nents  of  two  typical  cycles.  L.  C  and  P  stand  foi 
linear,  corner,  and  parallel  binary  relationships,  re¬ 
spectively. 
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Texture  Models  and  Image  Measures 
for  Segmentation 
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Abstract 

The  problem  of  texture  segmentation  is  to  identify  image 
curves  that  separate  different  textures.  The  segmentation 
problem  has  two  phases:  detecting  differences  between  tex¬ 
tures,  followed  by  localizing  the  edges  thus  detected.  This 
paper  focuses  on  texture  discrimination,  which  is  based 
on  differences  in  texture  measures  between  two  image  re¬ 
gions.  We  treat  texture  not  merely  as  image  pattern,  but 
as  the  projection  of  scene  structure;  two  image  textures  dif¬ 
fer  when  the  scene  textures  of  which  they  are  projections 
differ. 

A  model  is  proposed  for  a  large  class  of  scene  textures 
and  examines  their  images  under  projection.  A  set  of  im¬ 
age  texture  measures  is  proposed  that  allows  the  reliable 
discrimination  of  physically  different  textures.  Textures  are 
characterized  by  distributions  of  shape,  position  and  color  of 
image  substructures;  methods  for  estimating  these  distribu¬ 
tions  are  discussed.  A  forced-choice  method  is  proposed  for 
evaluating  the  ability  of  texture  measures  to  discriminate 
textures.  The  proposed  texture  measures  have  been  imple¬ 
mented,  and  empirical  tests  show  that  the  measures  can 
discriminate  a  large  set  of  natural  textures  with  90-100% 
accuracy,  compared  with  about  40%  accuracy  for  another 
popular  set  of  texture  measures. 

1  Introduction 

Texture  segmentation  seeks  to  identify  image  curves  on  ei¬ 
ther  side  of  which  textures  differ.  Such  image  texture  dis¬ 
continuities  usually  correspond  to  some  kind  of  discontinu¬ 
ity  in  the  scene1,  such  as  an  object  boundary,  occlusion, 
material  change,  etc.  A  texture  segmentation  algorithm 
bases  its  edge  decisions  on  the  results  of  a  set  of  texture 
measures  applied  to  the  image;  it  detects  edges  where  the 
measures  change  abruptly.  One  goal  of  texture  segmenta¬ 
tion  research  is  to  find  a  set  of  texture  measures  that  serve? 
to  discriminate  between  a  large  proportion  of  different  tex¬ 
tures.  Many  such  texture  measure^  have  been  proposed  in 
the  literature,  some  ad  hoc,  some  motivated  by  human  psy¬ 
chophysical  data.  Most  of  these  have  treated  image  texture 
as  simply  pattern  in  the  image,  failing  to  recogrize  that  we 

1  We  use  the  convention  that  scene  refers  to  a  three- 
dimensional  collection  of  objects,  of  which  the  image  is  the  twee 
dimensional  projection. 


seek  to  discriminate  textures  that  differ  i.i  the  scene  rather 
than  in  the  image.  To  this  end,  it  is  necessary  to  model 
texture  in  the  scene,  examine  its  projection  to  the  image, 
and  measure  image  features  that  serve  to  discriminate  dif¬ 
ferent  scene  textures.  This  paper  proposes  such  a  texture 
model,  derives  the  associated  image  features,  and  demon¬ 
strates  their  utility  in  distinguishing  real  textures. 

To  begin  with,  it  is  necessary  to  summarize  the  main 
points  of  an  earlier  paper  [27],  The  remainder  of  this  in¬ 
troduction  discusses  the  notion  of  a  texture  generation  pro¬ 
cess  as  a  way  to  model  texture,  and  examines  the  meaning 
of  texture  segmentation.  Section  2  proposes  a  method  for 
quantifying  the  ability  of  a  set  of  image  texture  measures 
to  discriminate  textures.  Section  3  examines  the  nature  of 
textures  in  the  3D  scene  and  proposes  a  model.  The  fol¬ 
lowing  section  examines  the  image  of  such  textures  under 
projection,  and  derives  image  measures  that  serve  to  dis¬ 
tinguish  different  scene  textures.  Sect)  u  5  presents  results 
that  show  that  the  proposed  image  measures  can  reliably 
distinguish  a  variety  of  nat  ural  textures,  and  compares  these 
results  with  those  obtained  with  other  texture  measures. 
The  final  section  recapitulates  and  identifies  directions  for 
further  research. 

1.1  Texture  Processes  and  Measures 

The  problem  of  texture  segmentation  may  be  broken  into 
two  phases:  detecting  differences  between  image  textures, 
and  localizing  the  texture  edges  thus  detected.  This  pa¬ 
per  focuses  on  the  former  aspect,  texture  discrimination; 
the  goal  i.  to  determine  whether  two  image  textures  cor¬ 
respond  to  different  scene  textures.  Textures  may  differ  in 
two  ways:  physically  and  perceptually.  Consider  an  anal¬ 
ogy  to  color  perception2:  two  colors  ar-  physically  identi¬ 
cal  when  their  color  spectrum  is  identical,  and  perceptually 
identical  when  they  cannot  be  distinguished  by  a  human 
observer.  The  colors  we  distinguish  are  determined  by  our 
set  of  color  receptors;  physically  different  colors  that  stimu¬ 
late  the  color  receptors  identically  are  perceived  as  identical, 
and  are  called  metamers. 

To  make  the  analogy  to  texture,  we  must  be  able  to  de¬ 
scribe  a  texture  physically.  For  this  purpose,  we  use  the 
parameters  of  the  process  that  generated  the  texture.  A 
texture  generation  process  is  the  physical  source  from  which 

2  1  bis  analogy  will  be  useful  at  other  points  in  this  paper. 
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a  scene  texture  arises;  the  process  is  governed  by  a  set  of  pa¬ 
rameters  that  determine  the  nature  of  the  textures  created. 
We  distinguish  between  a  texture  generation  process  and  its 
instances  in  the  scene  (which  are  three-dimensional),  and 
the  images  of  such  texture  instances.  Texture  generation 
processes  are  often  stochastic.  As  an  example,  a  process 
that  generates  a  three-dimensional  collection  of  randomly- 
placed  dots  is  a  texture  process  (its  parameters  include  the 
dot  density  and  placement  distribution)  and  any  particular 
collection  of  such  dots  is  an  instance  of  the  texture.  In  this 
paper,  the  term  texture  refers  to  texture  processes  and  tex¬ 
ture  instances  in  the  scene,  i.e.,  three-dimensional  textures, 
while  image  texture  refers  to  the  2D  projection  of  an  in¬ 
stance  of  a  scene  texture.  We  will  often  refer  to  an  instance 
of  a  scene  texture  simply  as  a  scene  texture,  when  there  can 
be  no  confusion. 

To  continue  the  analogy  with  color,  two  texture  processes 
are  physically  identical  when  their  parameters  are  identical. 
Two  image  textures  are  physically  identical  when  they  are 
the  images  of  identical  scene  textures  (assuming  identical 
imaging  geometry).  Two  image  textures  are  perceptually 
identical  when  a  human  observer  cannot  preattentively  dis¬ 
criminate  them;  these  depend  on  the  texture  features  mea¬ 
sured  by  our  visual  system.  Physically  different  textures 
may  be  perceptually  indistinguishable  (metamers).  For  ex¬ 
ample,  the  textures  in  Fig.  la  differ  physically  and  percep¬ 
tually,  while  those  in  Fig.  lb  are  metamers,  since  human 
vision  measures  no  texture  features  that  allow  immediate 
discrimination.  The  textures  that  a  vision  system  can  dis¬ 
tinguish  depend  on  the  texture  features  it  measures.  A 
vision  system  that  measures  a  finite  set  of  features  can¬ 
not  distinguish  every  possible  pair  of  physically  different 
textures  just  as  a  color  vision  system  that  measures  a  fi¬ 
nite  number  of  color  samples  cannot  distinguish  between 
all  physically  different  colors  [4).  However,  carefully  chosen 
color  sensors  allow  discrimination  of  most  physically  differ¬ 
ent  colors  [13];  similarly,  carefully  chosen  texture  measures 
allow  us  to  distil  gunh  most  physically  different  textures  as 
well.  This  paper  focuses  on  that  choice  of  texture  measures. 

1.2  Texture  Edge  Detection 

Vi  hen  texture  processes  are  stochastic  (as  is  usually  the 
case)  it  is  not  possible  in  general  to  decide  without  error  if 
two  textures  are  different  or  not.  As  a  consequence,  there 
is  no  unique  way  to  segment  a  textured  image.  A  simple 
thought  experiment  should  clarify  this:  imagine  an  image 
in  which  each  pixel  is  chosen  randomly  to  be  white  or  black, 
with  equal  probability.  Such  an  image  contains  no  under¬ 
lying  edge,  and  on  average  looks  ‘random.’  However,  it  is 
possible  (albeit  unlikely)  that  all  the  pixels  on  one  half  of 
the  image  are  black  while  all  of  those  on  the  other  half  are 
white.  It  is  incorrect  to  infer  that  a  discontinuity  exists 
■n  the  underlying  distribution.  Instead,  one  may  compute 
the  statistical  significance  of  the  observed  edge,  that  is,  the 
probability  of  the  observed  difference,  under  the  null  hy¬ 
pothesis  that  the  textures  are  identical.  Such  tests  do  not 
lead  to  binary  decisions  (edge,  no  edge)  unless  a  signifi¬ 
cance  level  is  chosen.  In  the  random-pixel  example,  the 


significance  of  an  edge  between  all-black  and  all-white  re¬ 
gions  would  be  very  high,  and  our  confidence  in  an  edge  at 
that  location  is  high.  In  general,  we  can  at  most  estimate 
relevant  image  parameters  on  either  side  of  the  edge  and 
compare  them;  our  certainty  that  there  is  an  edge  depends 
on  the  size  ct  the  difference  in  estimated  parameters.  When 
the  estimated  edge  probability  is  low,  the  observed  edge  is 
likely  to  be  the  result  of  a  true  scene  discontinuity. 

This  notion  of  edge  detection  leads  to  testing  for  edges  at 
many  positions,  orientations,  and  scales  in  the  image.  As  a 
measure  of  the  difference  between  two  image  textures,  we 
shall  use  the  statistical  significance  of  the  difference  between 
estimated  distributions.  (Note  that  one  determiner  of  the 
significance  of  the  edge  is  the  size,  or  scale,  of  the  region 
over  which  statistics  are  collected.)  This  difference  measure 
will  be  used  in  the  localization  stage  to  isolate  edges  at 
places  where  the  difference  is  locally  maximum.  Statistical 
significance  gives  a  precise  meaning  to  the  ‘output  of  the 
edge  detector’  and  provides  a  solid  foundation  for  further 
interpretation. 

2  Evaluating  Texture  Measures 

There  are  no  commonly  accepted  criteria  for  judging  the 
effectiveness  of  a  texture  segmentation  algorithm;  this  lack 
makes  it  difficult  to  compare  texture  measures  with  re¬ 
spect  to  their  ability  to  distinguish  textures.  Researchers 
typically  demonstrate  a  segmentation  program  by  showing 
that  it  finds  the  major  texture  edges  in  a  natural  image 
[18,19,23,30],  or  by  constructing  a  mosaic  of  textures  and 
showing  that  the  algorithm  correctly  segments  the  mosaic 
[5,6,20],  These  tests  confound  the  ability  of  a  set  of  texture 
measures  to  discriminate  textures  with  the  ability  of  the  al¬ 
gorithm  to  localize  edges.  This  paper  is  concerned  primarily 
with  texture  discrimination;  again,  one  goal  of  this  research 
is  to  determine  a  set  of  texture  measures  that  serves  to  dis¬ 
criminate  most  physically  different  textures.  This  section 
discusses  ways  to  evaluate  texture  measures.  The  texture 
model  and  associated  image  features  proposed  here  are  later 
evaluated  according  to  this  method. 

Suppose  we  are  to  evaluate  a  set  of  image  texture  mea¬ 
sures  M  with  respect  to  its  ability  to  discriminate  textures. 
Let  T  be  a  texture  process  and  T  be  the  image  of  an  instance 
of  T.  Denote  by  .\1  (T)  the  result  of  applying  the  measures 
M  to  T ;  1 1/(T)  is  a  vector  (possibly  of  length  1)  that  serves 
to  describe  T.  We  will  adopt  what  is  known  in  psychology  as 
the  forced-choice  paradigm  for  evaluating  texture  measures. 
The  system  is  presented  with  images  in  which  it  is  known 
that  there  is  a  texture  edge  at  one  of  A'  positions;  it  must 
determine  where  the  edge  is  located.  The  location  decision 
corresponds  to  the  position  where  the  maximum  difference 
between  texture  measures  occurs.  The  fraction  correct  is 
accumulated  over  a  number  of  trials.  When  M  cannot  dis¬ 
criminate  Ti  and  T2,  performance  is  near  the  chance  level 
1/A'.  This  evaluation  method  has  the  advantage  that  it 
does  not  use  thresholds,  nor  does  it  require  edges  to  be 
localized.  For  use  below,  let  us  denote  the  performance 
(fraction  correct)  of  M  on  T\  and  7j  by  Pst (7i ,  7i). 

It  is  useful  to  quantify  the  performance  of  a  set  of  mea- 
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Figure  1:  Examples  of  preattentively  distinguishable  and  indistinguishable  textures,  (a)  Difference  in  orientation  of  line 
segments  causes  segmentation,  (b)  No  texture  segmentation,  although  upon  inspection  the  triangles  on  top  and  bottom  are 
inverted.  This  physical  texture  difference  is  not  immediately  perceived. 


sures  M  over  a  population  of  textures  that  are  likely  to  be 
encountered.  Suppose  we  restrict  our  attention  to  such  a 
class  C  of  textures.  Each  member  of  C  is  a  texture  gener¬ 
ation  process  defined  by  some  set  of  parameters  that  con¬ 
trols  the  process.  Describing  the  distribution  of  textures  in 
C  requires  describing  the  distribution  of  texture  parame¬ 
ters.  This  distribution  may  be  very  complex  and  difficult 
to  describe,  but  it  does  exist.  We  measure  the  performance 
of  M  on  C  by  repeatedly  choosing  two  texture  processes 
7j  and  7j  from  C,  according  to  the  distribution,  and  de¬ 
termining  the  performance  Pm^Ti^Tt)  of  M.  The  average 
performance  measure  provides  an  indication  of  the  ability 
of  M  to  distinguish  two  textures  from  the  class  C.  When 
C  is  the  class  of  textures  likely  to  be  encountered  in  the 
natural  environment,  the  distribution  of  textures  in  C  is 
very  complex,  and  it  is  necessary  to  determine  the  perfor¬ 
mance  of  Af  empirically,  that  is,  by  testing  it  on  examples 
of  textures.  One  way  to  do  this  is  to  construct  test  images 
from  textures  drawn  from  a  large  sample  such  as  Brodatz’s 
album  [3].  Since  the  texture  process  itself  is  not  available  to 
generate  textures,  we  must  be  satisfied  with  using  different 
portions  of  a  single  instance.  This  is  the  approach  taken 
here. 


elongated  texture  elements  are  assumed  to  be  isotropically 
distributed  [16,31],  Such  models  are  unrealistic  in  natural 
scenes;  here  we  discuss  a  more  plausible  model  for  scene 
texture.  This  paper  provides  only  a  brief  description  of  the 
model;  more  details  may  be  found  in  [28], 

3.1  Examples 

The  scope  of  the  scene  texture  model  includes  the  following 
kinds  of  objects  whose  images  are  textured;  this  list  is  meant 
to  be  suggestive,  not  exhaustive. 

•  Homogeneous,  connected  objects  that  appear  iden¬ 
tical  everywhere,  such  as  a  volume  of  air,  water  or 
metal.  These  are  instances  of  null  textures,  discussed 
below. 


Collections  of  physically  separate  and  well-defined 
parts,  normally  held  together  by  gravity,  such  as  a 
pile  of  rocks,  sand,  etc. 

Collections  of  physically  separate  parts  distributed  in 
a  background  medium;  the  background  is  in  general 
another  texture.  Examples:  a  cluster  of  balloons,  a 
flock  of  birds,  rocks  in  snow. 

Variegated,  connected  objects,  such  as  wood,  gran- 


3  A  Model  for  Scene  Texture 


We  now  begin  the  discussion  of  models  for  scene  texture. 
Previous  models  for  texture  have  been  confined  primarily 
to  models  of  image  texture  (e.g.,  [1,8,11,17]).  These  im¬ 
age  models  are  of  limited  interest  to  us  here.  Shapc-from- 
texture  methods  usually  make  the  model  for  scene  texture 
more  explicit,  since  they  must  acknowledge  the  fact  that  im¬ 
age  texture  is  the  projection  of  scene  texture.  These  models 
usually  assume  that  textured  objects  have  smooth  surfaces 
covered  with  uniform-density,  identical  texture  elements  [2]; 


ite,  marble,  dirt,  etc.  Such  an  object  can  be  viewed 
as  a  collection  of  substructures  whose  boundaries  are 
poorly  defined;  its  surface  may  be  rough  or  smooth. 

In  addition,  we  allow  for  texture  arising  from  object  sur¬ 
faces,  including: 

•  Surface-marking  texture,  i.e.,  zero-thickness  mark¬ 
ings  on  a  smooth  surface.  This  is  the  class  of  scene 
textures  to  which  most  previous  models  (e.g.,  [2,16]) 
have  been  restricted.  Such  surfaces  have  been  mod¬ 
eled  by  mapping  a  texture  to  a  curved  surface,  or 
deforming  a  planar  texture  [7], 
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•  Surface-structure  texture.  This  class  encompasses 
surface  roughness  on  snow  and  dirt,  bricks,  etc.  The 
surface  and  the  interior  are  composed  of  the  same 
material. 

•  Surface-covering  texture.  This  class  includes  grass, 
carpet,  etc.,  where  the  surface  and  underlying  mate¬ 
rial  differ. 

The  image  textures  formed  by  projection  of  these  classes 
of  scene  textures  are  very  diverse  and  very  complex.  Be¬ 
cause  of  this  complexity,  we  cannot  model  them  in  detail; 
fortunately,  for  segmentation  purposes  there  is  no  need  to 
model  them  in  detail.  We  need  only  measure  image  features 
that  allow  us  to  discriminate  reliably  between  physically  dif¬ 
ferent  textures. 

3.2  The  model 

We  model  scene  textures  by  scene  texture  processes,  which 
fill  a  region  of  space  with  a  null  texture  and  a  possibly  empty 
collection  of  substructures,  each  of  which  is  again  described 
by  a  scene  texture.  The  null  texture  is  the  simplest  scene 
texture;  instances  of  null  textures  are  everywhere  identical 
at  the  finest  resolution.  An  instance  of  a  scene  texture  is 
the  three-dimensional  region  filled  by  the  process. 

The  substructures  in  a  non-null  texture  are  chosen  from 
some  statistical  distribution,  and  placed  (in  orientation  and 
position)  according  to  another  distribution.  We  character¬ 
ize  a  substructure  by  its  shape,  its  position  (including  ori¬ 
entation),  and  its  subtexture  (including  color).  A  non-null 
texture  process  is  specified  by  distributions  of  shape,  posi¬ 
tion,  and  subtexture  of  its  substructures,  and  by  the  color 
of  its  null  texture  background.  For  example,  a  sand  texture 
is  composed  of  grains  of  sand;  each  grain  is  chosen  from  a 
distribution  that  specifies  its  size,  shape  and  color,  and  it  is 
positioned  and  oriented  according  to  another  distribution. 

We  model  texture  arising  from  object  surfaces  in  a  simi¬ 
lar  manner.  Surface-structure  texture  and  surface-covering 
texture  are  modeled  as  smooth  surfaces  overlaid  with  sub¬ 
structures  to  provide  the  surface  texture.  Such  substruc¬ 
tures  are  again  chosen  from  statistical  distributions  that 
specify  their  shape,  subiexture,  and  position  and  orienta¬ 
tion  on  the  surface.  Similarly,  surface-marking  textures 
may  be  modeled  as  smooth  surfaces  covered  with  two- 
dimensional  substructures  chosen  from  distributions  speci¬ 
fying  their  attributes  and  positions. 

4  Image  Features 

For  texture  discrimination,  we  wish  to  determine  a  set  of 
image  texture  measures  that  is  sufficient  to  discriminate  a 
large  number  of  physically  different  textures.  To  do  this,  we 
examine  the  appearance  in  the  image  of  scene  textures  con¬ 
forming  to  the  model  proposed  above,  and  derive  texture 
features  that  serve  to  distinguish  textures  with  different  pa¬ 
rameters.  This  paper  provides  only  a  brief  summary  of  the 
derivation  of  the  texture  measures;  more  details  may  be 
found  in  [28]. 

To  discriminate  null  textures,  measurement  of  image 
color  generally  suffices;  differences  in  image  color  usually 


imply  differences  in  object  color  (assuming  common  illu¬ 
mination).  Regions  differing  in  color  spectra  differ  as  well 
in  intensity,  except  on  a  set  of  measure  zero,  and  hence 
intensity  cues  suffice  to  detect  almost  all  color  edges  (see 
[12]).  We  detect  edges  in  null  textures  where  image  in¬ 
tensity  changes  abruptly;  this  corresponds  to  the  classical 
notion  of  edge  detection. 

To  discriminate  non-null  textures,  we  wish  to  compare 
distributions  of  shape,  position,  and  color  of  image  sub¬ 
structures.  We  base  comparison  of  substructure  shape  on 
comparison  of  overall  length  and  width  of  image  substruc¬ 
tures,  and  on  length  of  image  edges.  We  also  estimate  dis¬ 
tributions  of  overall  orientation  of  image  substructures  and 
image  edges.  The  substructures  composing  image  texture 
are  generally  difficult  to  isolate  (e.g.,  in  grass  or  carpet),  so 
it  is  difficult  to  measure  their  attributes  directly.  Instead, 
we  use  feature  detectors  that  are  tuned  to  image  structures 
with  particular  attributes.  For  example,  we  use  an  elon¬ 
gated  center-surround  detector  to  detect  image  regions  of 
certain  length  and  width  that  differ  in  mean  intensity  from 
the  local  surround.  Summing  the  responses  of  these  detec¬ 
tors  in  an  appropriate  manner  provides  an  estimated  his¬ 
togram  for  a  single  feature,  as  follows. 

Suppose  that  image  structures  could  be  isolated  and  we 
could  measure  their  attributes  (e.g.,  length,  orientation)  di¬ 
rectly.  We  could  estimate  histograms  of  these  attributes 
by  dividing  the  domain  of  each  attribute  into  a  number  of 
ranges  (bins),  and  counting  the  number  of  structures  whose 
attribute  value  falls  into  each  range.  We  may  compute  ap¬ 
proximations  to  such  histograms  by  applying  feature  detec¬ 
tors  to  the  image.  A  feature  detector  is  tuned  to  image 
structures  with  a  particular  set  of  attributes,  such  as  size, 
elongation,  orientation  and  color.  Each  feature  detector 
computes  a  measure  of  the  confidence  in  the  presence  of  a 
structure  at  the  detector’s  location,  with  value  1  indicating 
high  confidence  that  there  is  a  structure  whose  attributes 
match  those  to  which  the  detector  is  tuned,  and  0  indicat¬ 
ing  low  confidence.  Suppose  again  that  image  structures 
are  well  defined,  and  suppose  further  that  feature  detectors 
respond  with  1  only  when  situated  on  an  image  structure 
whose  attributes  match  those  to  which  the  detector  is  tuned, 
and  0  otherwise.  Then  we  may  construct  a  histogram  for, 
say,  orientation,  by  evaluating  feature  detectors  at  densely 
sampled  locations  within  a  region,  and  summing  responses 
for  each  orientation  bin  over  all  sizes,  elongations  and  col¬ 
ors.  The  resulting  histogram  will  be  identical  to  the  one 
computed  by  isolating  the  structures  and  measuring  their 
features.  The  advantage  of  this  procedure  is  that  it  can  be 
applied  to  images  in  which  structures  are  not  well  defined. 

To  estimate  distributions  of  overall  structure  shape 
and  orientation,  we  employ  elongated  structure  detectors 
(ESD’s)  that  test  for  image  structures  by  comparing  aver¬ 
age  color  inside  and  outside  an  elongated  region.  Fig.  2 
shows  such  a  detector;  its  parameters  are  given  by  length 
L,  elongation  E  =  wc/ L,  orientation  6,  and  a  color  sensitiv¬ 
ity  function  C( A)  that  specifies  the  spectral  response  of  the 
detector  (cf.  [13]).  Image  substructures  may  be  elongated 
or  not;  they  appear  in  all  sizes  and  elongations.  Since  the 
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Figure  2:  An  elongated  structure  detector.  It  computes 
the  significance  of  the  difference  between  pixel  means  in 
center  and  surround. 


size  of  image  substructures  is  unknown  in  advance,  we  must 
employ  a  variety  of  lengths  L  of  detectors.  Corresponding 
to  each  length  are  a  variety  of  elongations  E  ranging  from  1 
(circular  or  isotropic)  to  £  <  1  (narrow).  For  each  of  these 
length/elongation  pairs  we  employ  a  variety  of  orientations 
9.  For  each  detector  in  this  three-dimensional  collection  we 
employ  a  set  of  n  spectral  sensitivity  functions;  in  typical 
situations,  n  =  3  for  color  images  (e.g.,  red,  green,  blue) 
and  n  =  1  for  black  and  white  images.  For  simplicity  in 
what  follows,  we  shall  restrict  attention  to  black  and  white 
images. 

Each  ESD  computes  the  statistical  significance  of  the  dif¬ 
ference  D  between  the  means  of  pixel  gray  levels  in  its  cen¬ 
ter  region  and  in  its  surround  region,  using  a  simple  <-test 
[24].  The  operator  computes  1  —  P(D  \  Ho),  where  Ho 
is  the  null  hypothesis  that  the  distributions  are  identical. 
This  response  approaches  1  as  the  structure  becomes  more 
significant.  Let  us  denote  the  response  of  an  operator  of 
length  L,  elongation  E,  orientation  6 ,  at  image  location 
(x,y),  by  R(L,  E ,9,  x,y).  We  wish  to  estimate  histograms 
of  L,  E  and  8  of  structures.  Suppose  we  divide  the  L  domain 
into  Ml  bins,  L\ ,  L 2, . . . ,  Lml  ;  similarly,  we  divide  E  into 
Ei ,  Ei, ... ,  £me  and  8  into  81,82, ,  8  m t .  The  histogram 
for  L  is  a  vector  with  Ml  components,  (lj ,  12.  •  •  • ,  Im/,)- 
Each  component  I,  is  the  sum,  over  the  image  region  under 
consideration,  of  R(L,,  E},8k,x,y),  where  1  <  j  <  Me  and 
1  <  k  <  Mg,  and  x  and  y  range  over  the  region.  We  com¬ 
pute  the  histograms  for  E  and  9  in  an  analogous  manner. 
Figure  3  illustrates  this  feature-value  summation  for  the  8 
feature. 

To  determine  the  significance  of  any  difference  between 
two  histograms,  we  use  the  edge  detection  method  proposed 
in  [27],  Briefly,  an  edge  detector  such  as  the  one  pictured  in 
Fig.  4  is  used.  It  divides  its  receptive  field  on  either  side  of 
a  hypothesized  edge  into  N  ‘slices’  and  computes  an  aver¬ 
aged  histogram  vector  for  each  slice.  Multivariate  regression 
techniques  are  used  to  fit  lines  to  the  data  on  either  side  of 
the  edge,  and  to  compute  the  intercept  vectors  at  the  cen- 


Figure  3:  Illustration  of  feature  summation  for  edges  in 
ESD  orientation.  Responses  of  ESDs  of  various  lengths  and 
elongations  are  summed  in  each  slice  at  each  orientation  to 
form  a  histogram  vector  (here,  with  four  components). 


ter  along  with  their  associated  covariance  matrices.  Edges 
are  manifested  as  differences  in  the  intercepts;  the  operator 
calculates  the  statistical  significance  of  any  difference.  The 
image  of  an  object  with  a  curved  surface  exhibits  smoothly 
changing  texture  features;  this  edge  detector  is  insensitive 
to  such  smooth  variation.  (This  type  of  edge  detector  has 
its  roots  in  the  statistical  techniques  for  intensity  edge  de¬ 
tection  proposed  by  Ilaralick  [10,9]  and  Leclerc  [21].) 

To  estimate  distributions  of  length  and  orientation  of  the 
edges  that  form  substructure  boundaries,  we  use  intensity 
edge  operators  of  various  lengths  and  widths.  These  op¬ 
erators  are  like  the  ones  just  described  (see  Fig.  4)  except 
that  they  average  image  intensity  over  each  slice.  To  ob¬ 
tain  orientation  and  length  histograms,  we  employ  inten¬ 
sity  edge  operators  of  several  lengths  and  orientations,  at 
densely  sampled  positions  in  the  image.  The  responses  are 
summed  across  length  and  orientation,  as  for  the  ESDs,  to 
calculate  the  histograms.  We  also  use  this  type  of  intensity 
edge  operator  to  compute  the  significance  of  any  overall 
difference  in  intensity  or  color  between  two  texture  regions. 
The  overall  color  difference  may  be  used  for  discrimination 
when  there  are  no  differences  in  other  texture  features;  this 
allows  segmentation  using  intensity,  as  in  classical  edge  de¬ 
tection. 

4.1  Feature  combination 

We  must  combine  the  significances  of  differences  between 
individual  distributions  to  obtain  the  overall  significance 
of  the  texture  difference  between  two  regions.  It  simplifies 
matters  greatly  if  we  assume  that  the  distributions  are  inde¬ 
pendent,  i.e.,  that  the  attributes  (such  as  shape  and  color) 
of  a  substructure  are  chosen  independently.  In  addition, 
we  assume  that  the  length  and  orientation  of  substructure 


Figure  4:  An  edge  detector.  It  divides  its  receptive  field 
into  JV  slices  (here,  N  =  4)  on  either  side  of  a  hypothesized 
edge,  as  shown. 

edges  are  independent  from  the  overall  size  and  orientation 
of  substructures.  While  the  latter  assumption  is  clearly  an 
oversimplification,  it  is  a  reasonable  approximation,  partic¬ 
ularly  since  the  small-scale  edges  bounding  a  structure  are 
often  created  by  processes  independent  from  those  which 
create  the  structure  as  a  whole.  Similarly,  we  may  treat  any 
overall  difference  in  intensity  or  color  between  two  regions 
as  independent  from  differences  in  distributions  of  substruc¬ 
ture  features,  although  this  is  again  an  oversimplification. 

Under  the  assumption  of  independence,  the  probability 
of  observing  all  feature  differences  between  two  regions  si¬ 
multaneously  is  the  product  of  the  individual  probabilities, 
so  the  overall  significance  of  the  texture  difference  is  just 
the  product  of  the  individual  significances  When  feature 
distributions  are  in  fact  correlated,  the  computed  overall 
significance  overstates  the  true  significance  of  the  edge.  It  is 
worth  noting  that  human  vision  appears  to  assume  indepen¬ 
dence  of  features  as  well,  since  we  cannot  perceive  texture 
differences  defined  by  a  conjunction  of  features  [25,26,14]. 

5  Empirical  Results 

The  acid  test  of  a  generally  useful  set  of  texture  measures 
is  its  ability  to  distinguish  many  different  textures.  Proof 
of  such  an  ability  rests  mainly  upon  empirical  tests,  which 
are  the  subject  of  this  section.  A  forced-choice  test  such 
as  that  discussed  in  Section  2  was  conducted  on  a  number 
of  texture  pairs  constructed  from  images  taken  from  Dro- 
datz’s  [3]  album.  A  4-alternative  test  was  used;  the  possible 
edge  locations  we^e  vertical  at  left  and  right,  horizontal  at 
top  and  bottom,  placed  symmetrically.  (For  simplicity,  the 
edge  was  always  placed  at  the  bottom,  though  this  was  not 
known  to  the  edge  detection  program.)  Figures  5  and  6 
show  two  of  the  texture  pairs  tested.  Although  an  accurate 
test  would  use  many  texture  trials  to  determine  Pm(T\  ,  7j) 
for  a  texture  pair,  these  tests  used  only  one  trial;  these  re¬ 
sults  should  therefore  be  viewed  as  an  approximation.  In 
these  tests,  no  preprocessing  was  performed  on  the  image 


to  equalize  histograms  or  overall  intensity. 

This  section  reports  the  results  of  three  series  of  exper¬ 
iments.  These  experiments  determined  the  discrimination 
ability  of  three  different  sets  of  texture  measures  on  a  set 
of  45  texture  pairs.  The  first  and  second  series  used  the 
same  set  of  texture  pairs,  while  the  third  used  a  different 
set  of  textures.  We  discuss  the  first  series  in  detail,  then 
summarize  the  results  of  the  second  and  third  series. 

In  all  experiments,  distributions  of  five  features  were  esti¬ 
mated:  size,  elongation  and  orientation  of  elongated  struc¬ 
tures,  along  with  length  and  orientation  of  intensity  edges; 
averaged  intensity  was  used  as  the  sixth  feature.  The  im¬ 
ages  were  256  x  256  pixels.  In  the  first  series,  the  ESD's 
were  of  two  sizes  5  (8  and  12  pixels);  they  each  had  two 
elongations  E  (0.7  and  0.30);  and  they  had  four  orienta¬ 
tions  9  (0,  45,  90  and  135  degrees).  They  used  equal  center 
and  surround  areas,  so  tfc  =  wa  (see  Fig.  2).  Elongation  of 
1.0  was  treated  as  an  isotropic  operator;  these  were  circular 
center-surround  operators  with  no  orientation  and  diameter 
equal  to  S.  There  were  thus  2x2x4  =  16  different  ESD’s 
and  two  sizes  of  isotropic  operator.  There  were  two  lengths 
L  of  intensity  edge  detector,  again  8  and  12  pixels,  and  the 
same  four  orientations  <j>.  The  width  of  the  small  edge  de¬ 
tectors  was  a  constant  fraction  of  the  length,  W  —  0.6 L, 
and  they  used  N  —  4  slices. 

To  test  for  differences  in  individual  distributions,  the  fea¬ 
ture  edge  detectors  discussed  in  Section  4  (see  Fig.  4)  were 
used.  Each  of  these  edge  detectors  had  length  I  =  80  pixels, 
width  W  —  50,  and  used  N  —  5  slices,  except  in  the  case 
of  orientation  which  used  N  =  6.  The  average-intensity 
edge  detector  had  the  same  dimensions  and  N  =  4.  For 
ease  of  interpretation,  the  operators  in  the  implementation 
computed  the  negative  logarithm  of  the  significance;  this 
value  grows  with  the  edge  significance. 

Discrimination  tests  were  carried  out  for  the  45  texture 
pairs  constructed  from  10  texture  images  taken  from  Bro- 
datz.  Figures  5  and  6  show  two  of  the  images  tested;  Ta¬ 
bles  1  and  2  contain  the  negative  log  significance  of  the 
texture  edges  at  four  locations.  The  six  features  are:  size 
5,  orientation  9  and  elongation  E  of  elongated  structures, 
length  L  and  angle  0  of  intensity  edges,  and  averaged  in¬ 
tensity  /.  The  overall  edge  significance  is  the  sum  of  the 
individual  significances.  Table  3  summarizes  all  the  results. 
Each  entry  in  this  table  contains  two  symbols:  the  first 
indicates  results  using  all  features,  including  the  overall  in¬ 
tensity  difference,  the  second  indicates  results  excluding  in¬ 
tensity.  In  this  series  of  tests,  the  textures  were  correctly 
discriminated  in  41  images,  or  91.1%.  When  the  overall  in¬ 
tensity  difference  was  not  used  for  discrimination,  accuracy 
was  still  high,  38  of  45,  or  84.4%. 

It  is  of  interest  to  see  how  discrimination  ability  changes 
when  a  different  variety  of  texture  measures  is  used.  A  sec¬ 
ond  series  of  experiments  used  a  larger  variety  of  lengths 
of  ESD  and  edge  detector  and  slightly  different  elongations 
for  the  ESDs;  the  parameters  of  the  feature  detectors  are 
shown  in  Tabic  4.  A  total  of  39  operators  were  used;  the 
texture  images  were  the  same  as  in  Series  I.  The  results  of 
this  experiment  are  summarized  in  Table  5.  Of  the  45  tex- 


Table  1:  Edge  significances  for  D5-D84  (mica/raffia,  Fig.  5)  image. 


1 _ s 

6 

E 

L 

<t> 

I 

Overall 

Top 

gp  * 

p*?r  jj| 

2.418 

5.968 

Bottom 

1.670 

If 

1.314 

4.907 

11.117 

Left 

0.465 

0.377 

0.216 

0.341 

0.211 

1.785 

Right 

1.635 

0.758 

1.375 

1.372 

0.578 

0.242 

5.959 

Table  2:  Edge  significances  for  D23-D94  (pebbles/bricks,  Fig.  6)  image. 
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Figure  5:  A  texture  edge  constructed  from  Brodatz’s  im-  Figure  6:  A  texture  edge  formed  from  images  D23  (peb- 
ages  D5  (expanded  mica)  and  D84  (raffia).  bles)  and  D94  (bricks). 
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ture  pairs,  38  (84.4%)  were  correctly  discriminated  using  all 
features  including  intensity,  while  41  (91.1%)  were  correctly 
discriminated  when  intensity  was  ignored. 

The  third  series  of  experiments  used  a  larger  variety  of 
orientations  of  ESD  and  of  edge  detector,  on  a  different  set 
of  45  texture  pairs,  again  from  Brodatz;  Table  4  contains 
the  details.  Table  6  summarizes  the  results  of  this  experi¬ 
ment:  45  (100%)  were  correctly  discriminated  using  all  fea¬ 
tures  including  intensity,  while  44  (97.7%)  were  correctly 
discriminated  when  intensity  differences  were  ignored. 

These  empirical  results  indicate  that  these  texture  mea¬ 
sures  are  able  to  reliably  distinguish  natural  textures,  for 
a  variety  of  choices  of  the  measures’  parameters.  However, 
one  should  not  regard  these  measures  as  suitable  for  dis¬ 
tinguishing  all  textures.  The  textures  used  in  this  series  of 
experiments  were  all  quite  different.  The  current  parame¬ 
ter  values  would  probably  not  work  well  for  discriminating 
textures  that  are  more  similar  to  one  another,  particularly 
fine-grain  textures  such  as  sand.  For  this  purpose,  a  greater 
range  of  size  and  elongation  resolution  would  be  required. 
In  addition,  more  than  four  or  five  orientations  might  be  re¬ 
quired  to  discriminate  oriented  textures  differing  by  a  small 
angle. 

5.1  Comparison  with  other  measures 

The  reader  may  object  that  the  texture  measures  proposed 
here  are  complex  and  time-consuming  to  compute,  and 
that  simpler,  more  easily  computable  measures  might  per¬ 
form  just  as  well.  For  example,  the  texture  measures  pro¬ 
posed  by  Laws  [20]  are  well  known,  efficiently  computable, 
and  have  been  shown  to  be  effective  in  classifying  textures 
[20,22].  For  comparison  purposes,  Laws’  texture  features 
were  tested  on  the  same  set  of  45  texture  pairs  used  in 
the  first  two  series  of  discrimination  experiments.  Laws’ 
method  computes  the  “texture  energy”  of  several  texture 
features  as  follows.  The  image  is  convolved  with  each  of 
several  small  kernels  K,,  yielding  feature  images  F,.  Each 
F,  is  then  convolved  with  a  15  x  15  pixel  “texture  energy” 
filter  that  computes  an  approximation  to  the  variance  in 
a  window  by  summing  the  absolute  values  of  the  feature 
values.  This  operation  produces  several  “texture  energy” 
images  T,,  which  collectively  form  a  vector- valued  texture 
energy  image.  These  vector  values  were  used  as  input  to 
the  texture  edge  detector  described  above  and  in  [27]  to 
compute  the  significance  of  an  edge. 

The  first  series  of  tests  used  Laws’  nine  3x3  masks.  An 
edge  operator  with  JV  =  11  slices  was  used  to  test  for  an 
edge.  (The  reader  is  referred  to  [20]  for  the  details  of  these 
masks.)  Table  5  summarizes  the  results:  the  texture  edge 
was  correctly  located  in  17  of  the  45  texture  images,  or 
37.3%;  chance  performance  in  this  task  is  25%. 

This  low  level  of  performance  may  be  due  to  the  small 
size  of  the  masks,  which  capture  only  fine-grain  texture  in¬ 
formation;  larger  masks  should  be  better  able  to  capture 
larger-scale  texture  information,  allowing  better  texture  dis¬ 
crimination.  Laws  also  uses  a  set  of  sixteen  5x5  kernels 
for  texture  classification.  Four  of  these  5x5  masks  (L5E5, 
L5S5,  E5S5,  and  R5R5)  have  been  claimed  to  be  the  most 


effective  in  classification  tasks  [22].  A  second  series  of  tests 
employed  these  4  masks  and  their  90  degree  rotations  (a 
total  of  seven  masks,  since  R5R5  is  unchanged  under  ro¬ 
tation).  The  results,  using  an  edge  operator  with  N  =  9 
slices,  are  also  shown  in  Table  5;  19  of  the  45  images  were 
correctly  discriminated,  or  42.2%.  The  larger  masks  allow 
for  slightly  better  discrimination,  but  performance  is  still 
unsatisfactory. 

These  rather  poor  results  may  be  attributed  to  a  number 
of  causes.  First,  the  small  kernels  are  sensitive  to  only  fine- 
scale  texture  features,  while  the  texture  that  distinguishes 
the  image  regions  is  characterized  by  larger-scale  structures. 
If  one  must  distinguish  textures  characterized  by  large-scale 
features,  the  efficiency  of  small  masks  must  be  sacrificed; 
there  is  no  alternative  but  to  measure  large-scale  features. 
Second,  the  15  x  15  operator  that  measures  the  “texture 
energy”  blurs  the  texture  feature  values  across  the  edge. 
This  effect  causes  the  texture  energy  to  change  gradually 
across  the  edge,  giving  the  intercepts  of  the  fitted  regression 
lines  a  high  variance;  this  in  turn  causes  the  differences  in 
intercepts  to  have  lower  significance.  The  feature  detectors 
proposed  in  this  paper  are  quite  sensitive  to  position  and 
are  much  less  prone  to  the  blurring  effect,  so  they  do  not 
suffer  from  this  loss  of  significance.  An  edge  operator  that 
assumed  uniform  texture  fields  on  either  side  of  the  edge 
would  average  out  the  blurred  feature  values,  leading  to 
more  effective  discrimination3,  but  such  edge  operators  are 
not  as  useful  in  practice  since  texture  images  are  rarely 
uniform. 

6  Conclusion 

In  this  paper,  we  have  viewed  texture  as  not  merely  pat¬ 
tern  in  the  image,  but  as  the  projection  of  structure  in  the 
scene.  We  have  broken  the  problem  of  texture  segmentation 
into  two  phases,  texture  discrimination  and  localization  of 
edges  thus  detected;  this  paper  has  focused  on  the  former 
topic.  To  determine  when  two  image  textures  differ,  we 
must  look  beyond  the  image  and  infer  parameters  of  the 
processes  by  which  the  textures  were  created,  and  try  to 
determine  whether  those  parameters  differ.  To  this  end, 
we  must  model  texture  in  the  scene,  model  its  projection  to 
the  image,  and  determine  a  set  of  image  features  that  allows 
us  to  discriminate  reliably  between  physically  different  tex¬ 
tures  in  the  modeled  class.  The  proposed  texture  model  is 
general  enough  to  model  a  large  class  of  textured  objects;  it 
is  based  on  distributions  of  shape,  position  and  subtexture 
of  texture  substructures.  Differences  in  simple  image  fea¬ 
tures  derived  from  these  distributions  allow  segmentation 
of  physically  different  textures.  The  features  used  are  size, 
elongation  and  orientation  of  elongated  structures,  length 
and  orientation  of  image  edges,  and  overall  color  differences 
between  image  regions.  We  use  statistical  tests  to  deter¬ 
mine  the  significance  of  differences  between  estimated  dis¬ 
tributions  of  these  image  features.  This  method  provides 

3  In  fact,  when  such  a  test  was  carried  out  on  the  same  set 
of  texture  pairs,  using  a  multivariate  (-test  such  as  described  in 
[27],  both  the  nine  3x3  masks  masks  and  the  proposed  texture 
measures  discriminated  the  textures  with  1 00%  accuracy. 


Table  7:  Results  using  Laws’  texture  measures.  First  circle  in  each  entry  is  for  3x3  operators,  second  is  for  5x5  operators. 
Solid  circle,  correct  discrimination;  hollow,  incorrect. 


not  only  an  indication  of  when  textures  differ,  but  also  pre¬ 
dicts  how  likely  we  are  to  be  wrong  if  we  infer  the  existence 
of  a  true  discontinuity. 

It  must  be  emphasized  again  that  no  single  set  of  im¬ 
age  texture  measures  will  suffice  to  discriminate  between 
all  physically  different  textures.  Corresponding  to  any  set 
of  texture  measures  there  are  classes  of  textures  that  the 
measures  cannot  discriminate.  The  texture  measures  pro¬ 
posed  here  are  capable  of  discriminating  a  large  number  of 
textures,  but  there  are  certainly  many  more  that  are  indis- 
criminable.  For  example,  human  vision  seems  to  discrim¬ 
inate  some  textures  on  the  basis  of  terminations  of  elon¬ 
gated  structures  [14];  it  might  be  desirable  for  a  machine 
vision  system  to  identify  similar  features.  One  must  be 
very  clear  about  the  goals  of  a  texture  segmentation  al¬ 
gorithm.  If  it  is  to  discriminate  just  those  textures  that 
humans  discriminate,  then  careful  attention  must  be  paid 
to  psychophysical  data  that  help  to  define  the  texture  fea¬ 
tures  employed  by  our  visual  system  (e.g.,  [26,14,15]).  If  it 
is  known  that  two  classes  of  textures  in  the  world  must  be 
discriminated,  then  image  measures  can  be  constructed  for 
that  purpose.  In  particular,  if  image  structures  are  known 
to  be  well  defined,  then  distributions  of  their  attributes  can 
be  measured  directly  (e.g.,  [30]),  and  discrimination  based 
on  these  directly-measured  distributions  will  be  superior  to 
that  based  on  more  general  measures  such  as  those  pro¬ 
posed  here.  Very  fine  grain  textures,  such  as  sand,  may 
require  texture  measures  such  as  intensity  variance  for  dis¬ 
crimination.  In  the  absence  of  special  information  about  the 
domain,  it  is  reasonable  to  measure  image  features  similar 
to  those  measured  by  humans;  this  allows  a  machine  vision 
system  to  detect  those  edges  detected  by  humans,  and  pre¬ 
vents  it  from  hallucinating  edges  not  perceived  by  humans. 
The  texture  measures  proposed  here  are  similar  to  Julesz’s 
‘textons’  [14,15],  and  a  texture  segmentation  program  that 
uses  these  measures  should  have  texture  discrimination  abil¬ 
ity  similar  to  that  of  humans. 

Several  problems  await  further  research.  The  other  as¬ 
pect  of  the  image  segmentation  proLlem,  namely  edge  lo¬ 
calization,  is  the  subject  of  another  report,  and  we  shall  not 


delve  further  into  that  subject  here  (but  see  [29]).  The  best 
values  for  the  parameters  of  the  feature  detectors  are  cur¬ 
rently  under  investigation.  It  might  be  useful  to  use  some 
tuning  method  that  makes  slight  changes  to  the  parame¬ 
ters  to  obtain  optimal  performance  over  a  set  of  example 
texture  images.  Further  empirical  tests  must  be  performed 
in  order  to  determine  the  ability  of  these  texture  measures 
to  distinguish  textures.  However,  the  empirical  results  pre¬ 
sented  above  are  encouraging  evidence  for  the  effectiveness 
of  these  texture  measures. 
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Abstract 

Standard  edge-detectors  fail  to  find  most  relevant  edges,  find¬ 
ing  either  too  many  or  too  few,  because  they  lack  a  geometrical 
model  to  guide  their  search.  We  present  a  technique  that  inte¬ 
grates  both  photometric  and  geometric  models  with  an  initial 
estimate  of  the  boundary.  The  strength  of  this  approach  lies  in 
the  geometric  model’s  ability  to  overcome  various  photometric 
anomalies,  thereby  finding  boundaries  that  could  not  otherwise 
be  found.  Furthermore,  the  model  can  be  used  to  score  edges 
based  on  their  goodness  of  fit,  thus  allowing  one  to  use  the  se¬ 
mantic  model  information  to  accept  or  reject  them. 

1  Introduction 

In  real-world  images,  object  boundaries  cannot  be  detected 
solely  on  the  basis  of  their  photometry  because  of  the  presence 
of  noise  and  various  photometric  anomalies.  Thus,  all  methods 
for  finding  boundaries  based  on  purely  local  statistical  criteria 
are  bound  to  make  mistakes,  finding  either  too  many  or  too  few 
edges,  ususHi^based  on  arbitrary  thresholds. 

To  supplement  the  weak  and  noisy  local  information,  we  con¬ 
sider  the  geometrical  constraints  that  object  models  can  provide. 
We  do  this  by  describing  a  boundary  as  an  elastic  curve  with  a 
deformation  energy  derived  from  the  geometrical  constraints,  as 
suggested  by  Terzopoulos  and  Kass  et  al.  [4.7].  Local  minima 
in  this  energy  correspond  to  boundaries  (hat  exactly  match  the 
object  model.  We  incorporate  the  photometric  constraints  by 
defining  a  photometric  energy  as  the  average  along  the  curve 
of  a  scalar  energy  field  derived  from  the  photometric  model  of 
an  edge.  Local  minima  in  this  energy  correspond  to  boundaries 
that  exactly  match  the  photometric  model.  We  find  a  candidate 
boundary  by  deforming  the  curve  in  such  a  way  as  to  minimize 
its  total  energy,  which  is  the  sum  of  the  deformation  and  photo¬ 
metric  energies.  We  call  this  deformation  process  "optimizing" 
the  curve.  Once  a  curve  has  been  optimized,  i.e..  once  it  has 
settled  in  a  local  minimum  of  the  total  energy  (which  is.  in  ef¬ 
fect.  a  compromise  between  the  two  constraints),  we  can  then 
use  the  object  models  more  fully  to  determine  whether  the  curve 
actually  corresponds  to  an  object  boundary. 

Such  "energy-minimizing  curves"  have  two  key  advantages: 

•  The  geometric  constraints  are  directly  used  to  guide  the 
search  for  a  boundary. 

•  The  edge  information  is  integrated  along  the  entire  length 
of  the  curve,  providing  a  large  support  trillion I  including 

'The  work  reported  here  was  partially  supported  hv  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  under  contracts  P A ( ' A 7S-S .VC- (H iHt  and 
MI)A90.'J-86-t*-ftOR4.  We  wish  to  thank  Demetri  Terzopottlos  for  the  many 
discussions  we  have  had  concerning  this  work. 


the  irrelevant  information  off  of  the  curve. 

Taken  together,  these  advantages  allow  energy-minimizing 
curves  to  find  photometrically  weak  boundaries  that  local  edge 
detectors  simply  could  not  find  without  also  finding  many  irrel¬ 
evant  boundaries. 

As  originally  presented  by  Terzopoulos  and  Kass  [4,7],  energy¬ 
minimizing  curves  were  basically  designed  for  an  interactive  en¬ 
vironment.  In  this  paper,  we  extend  these  ideas  so  that  energy- 
minimizing  curves  can  be  more  easily  embedded  in  fully  auto¬ 
mated  vision  systems,  and  so  that  we  can  be  confident  that 
optimized  curves  are  good  candidates  for  object  boundaries. 

In  the  next,  section,  we  describe  a  photometric  edge  model 
based  on  a  standard  definition  of  a  step-edge  and  show  that  the 
local  minima  of  an  energy-minimizing  curve  approximate  this 
criterion  accurately  in  more  general  conditions.  Next,  we  derive 
the  deformation  energy  and  discuss  our  implementation  for  the 
simple  case  of  smooth  curves.  We  then  describe  an  application 
of  these  smooth  curves  to  delineate  roads  and  the  use  of  more 
global  geometric  constraints  in  the  case  of  rectilinear  objects 
such  as  buildings  in  aerial  images. 

2  A  Photometric  Model:  Step-Edges 

Haralick  [3],  Canny  [1],  and  Torre  and  Poggio  [6]  define  step-edge 
points  as  those  for  which  the  first  directional  derivative  of  image 
intensity,  in  the  direction  of  the  gradient,  is  extremal.  That  is, 
an  edge  point  (xq.i/o)  satisfies 


d2l(x,y) 

dgo 


(T,y)— (xo,yo ) 


=  g(*o,2/o) 


^  V- 

fr~  =  go  •  V . 


y  o) 

(VZ(x0.  po)| 


and  l(j\y)  represents  the  image  intensifies  after  convolution 
with  a  twice-difTerentiable  kernel  (typically  a  Gaussian).1  It  can 
be  shown  that  Eq.  ( 1 )  is  equivalent  to 


d\Xl[.r,y)\ 

#go 


(x,y)  =  (r0.yo) 


In  other  words,  the  Haralick  et  al.  criterion  for  an  ideal  step-edge 
point  is  equivalent  to  the  criterion  that  the  magnitude  of  the 
image  intensity  gradient  be  maximal  in  the  gradient  direction. 

'This  definition  is  a  good  approximation  to  the  position  of  discontinu¬ 
ities  in  the  underlying  intensity  function  when  the  diameter  of  support  of 
tin-  operator  is  much  smaller  than  the  radius  of  curvature  of  the  edge  In 
particular,  it  is  not  applicable  at  corners  and  junctions  of  edges. 
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However,  for  noisy  non  ideal  step-edges  the  gradient  direction 
may  become  very  unreliable,  especially  at  some  distance  from 
the  edge  itself.  We  therefore  use  the  following,  somewhat  more 
general,  definition  of  an  edge: 

Definition  1  An  edge  is  a  curve  C  whose  points  have  a  gradient 
magnitude  that  is  maximal  in  the  direction  normal  to  the  curve. 
That  is,  all  points  along  the  curve  ( called  edge-points)  satisfy 


<fn(f(a)) 


=  0, 


(2) 


where  s  is  the  arc-length  ofC,  f(s)  is  a  vector  function  mapping 
the  arc-length  s  to  points  (x,y)  in  the  image,  and  n(f(s))  is  the 
normal  to  C. 

Note  that  this  definition  is  equivalent  to  the  Haralick  et  al.  def¬ 
inition  for  ideal  noise-free  step-edges. 

To  incorporate  this  photometric  model  in  the  energy¬ 
minimizing  curve  approach,  we  define  the  photometric  energy 
of  the  curve  C  as 


=.((?)  = 


-J|VZ(f(s))|rfs 

|C| 


where  |C|  is  the  length  of  C. 

To  understand  why  we  have  chosen  this  particular  energy 
function,  consider  the  following.  We  prove  in  the  Appendix 
that,  when  C  is  either  an  open  or  closed  curve  that  has  been 
optimized,  i.e..  when  it  is  a  local  minimum  of  £p  with  respect 
to  infinitesimal  deformations  of  the  curve, 


dn(f  («)) 


=  7(») 


(|VZ(f(.s))|- 


f\Vms)))ds 
\C\ 


)■ 


(3) 


where  ■>(«)  is  the  curvature  of  C .  Furthermore,  we  prove  that 
when  C  is  an  open  curve  that  has  been  optimized, 


|VI(f(0))|=  |VZ(f(|C|)|  = 


J|VZ(f(»))|  ds 

\C\ 


(4) 


also  holds. 

These  two  equations  have  the  following  consequences.  F.q.  (3) 
implies  that  the  points  along  an  optimized  curve  all  are  edge- 
points  when  either  the  curvature  is  zero  (i.e..  it  is  a  straight-line 
segment),  or  the  magnitude  of  the  gradient  along  the  curve  is 
constant.  Consequently,  optimized  curves  whose  curvature  is 
small  or  for  which  the  magnitude  of  the  gradient  is  approxi¬ 
mately  constant  are  good  approximations  to  edges.  In  partic¬ 
ular.  we  have  found  that  replacing  |VJ(i,y)[  by  Iog(VZ(c,y)| 
(which  has  no  effect  for  ideal  step-edges)  yields  more  stable  so¬ 
lutions  because  the  second  term  in  Eq.  (3)  is  smaller  for  non 
constant  edge  strengths,  and  hence  the  approximation  is  gener¬ 
ally  better. 

For  an  open-ended  curve.  Eq.  (-1)  must  also  be  satisfied.  This 
equation  implies  that  the  curve  is  stable  only  when  the  gradi¬ 
ent  magnitude  at  the  end-points  equals  the  average.  In  other 
words,  an  open-ended  energy-minimizing  curve  placed  exactly 
along  an  edge  will  shrink  or  expand  (following  the  edge!)  until 
this  condition  is  met .  When  the  gradient  magnitude  is  constant . 
of  course,  this  condition  is  always  met.  and  curves  therefore  re¬ 
main  unchanged.  Thus,  again,  using  log  |VI(.r.  y)|  yields  more 
stable  solutions  because  this  equation  is  more  nearly  satisfied  for 
non  constant  edge  strengths.  An  important  point  is  that  if  we 
had  chosen  to  simply  use  the  integral  of  gradient  magnitude,  as 


opposed  to  the  average,  this  equation  would  not  hold  and  curves 
would  simply  expand  without  bound! 

After  optimization,  candidate  edges  can  be  scored  on  how  well 
they  match  the  photometric  model  by  computing  the  proportion 
of  points  along  the  curve  that  satisfy  Eq.  (2)  to  within  some 
tolerance.  We  use  this  score  in  our  applications  to  accept  or 
reject  any  given  edge. 

3  A  Geometric  Model:  Smooth  Curves 

3.1  Theory 

We  have  seen  in  the  previous  section  that  energy-minimizing 
curves  match  edges  well  wherever  they  have  a  low  curvature. 
To  ensure  stable  results,  we  define  the  deformation  energy  of 
such  curves  so  that  their  curvature  remains  small  and  the  mini¬ 
mum  energy  state  is  close  to  the  photometric  edge.  This  can  be 
achieved  simply  by  defining  the  deformation  energy  of  a  curve 
C  as 

£p(C)  =  J  "ilsfds. 

where  7 ( .s )  is  the  curvature  of  C.  The  total  energy  of  such  a  curve 
is  then  £{C)  =  £p(C)  +  A £o(C).  where  £p(C)  is  the  photometric 
energy  defined  in  the  previous  section  and  £q[C  )  the  deformation 
energy  described  above. 

To  delineate  a  photometric  edge  using  one  of  these  curves, 
one  must  provide  an  initial  estimate  of  the  location  of  the  curve 
and  then  optimize  it  using  the  procedure  described  in  the  next 
subsection.  The  choice  of  A  is  directly  related  to  the  roughness 
of  the  initial  estimate  as  follows.  Let  C,  be  the  initial  curve  and 
Cj  be  the  curve  after  optimization.  Let  Si  denote  the  operator 

'  SC  L. 

that  is.  S,  is  the  variational  operator  evaluated  at  the  initial 
curve.  Define  6/  similarly  for  the  final  curve.  Since,  by  defini¬ 
tion.  Sj£(C)  =  0,  then 


A  = 


\Si£p[C)\ 

Ify  £d(C)I 


When  the  initial  estimate  is  known  to  be  close  to  the  final  an¬ 
swer.  we  choose 

ISi£D(C)l 

so  that  we  remain  as  close  to  the  initial  estimate  as  possible. 
If  instead  we  want  to  smooth  the  initial  estimate,  we  can  use 
higher  values  of  A.  This  is  in  fact  the  simplest  way  of  imposing 
a  geometric  model,  in  this  case  smoothness,  upon  the  data. 


3.2  Implementation 

In  the  actual  implementation,  the  curves  arc  described  as  poly¬ 
gons  with  n  equidistant  vertices  X  =  {(r,  y,),  i  =  1 . 11}  and 

the  deformation  energy  can  be  discretised  as 

£n  =  ~  ■r<-t  _  •ri+i  )2  +  (2.v,  -  y,-i  -  y,+i  )2  (3) 


when  -7 (a)  is  small.  To  perform  t lie*  optimization  we  could  use  a 
simp!*’  gradient  descent  technique  as  we  have  already  su^gestt'd 
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[5],  but  it  would  be  extremely  slow  for  curves  with  a  large  num¬ 
ber  of  vertices.  Instead,  we  embed  the  curve  in  a  viscous  medium 
and  solve  the  equation  of  dynamics 

as  dx  A 

dx+a~df-0' 

where  S  =  £p  +  \Sd  and  a  is  the  viscosity  of  the  medium  [7], 
Because  the  deformation  energy  So  in  Eq.  5  is  quadratic,  its 
derivative  with  respect  to  X  is  linear,  and  therefore 


where  A’  is  a  pentadiagonal  matrix.  Thus,  each  iteration  of  the 
optimization  amounts  to  solving  the  linear  equation 

osP 

A'.V,  +  a(.Y(  -  A',_i )  =  ~ 

-Y  r  _ , 

We  start  with  an  initial  step  size  b  and  compute  the  viscosity  so 
that 

</tn  OS 
°  =  ~S~  OX 

where  n  is  the  number  of  vertices.  This  ensures  that  the  initial 
displacement  of  each  vertex  is  on  the  average  of  magnitude  b. 
Because  of  the  non  linear  term,  we  must  verify  that  the  energy 
has  decreased  from  one  iteration  to  the  next.  If.  instead,  the 
energy  has  increased,  the  curve  is  reset  to  its  previous  position, 
the  step  size  is  decreased,  and  the  viscosity  recomputed  accord¬ 
ingly.  This  is  repeated  until  the  step  size  becomes  less  than  some 
threshold  value.  In  most  cases,  because  of  the  presence  of  the 
linear  term  that  propagates  constraints  along  the  whole  curve 
in  one  iteration,  only  a  very  small  number  of  iterations  are  re¬ 
quired  to  optimize  the  initial  curve.  For  example,  going  from  the 
initial  estimate  of  the  closed  curve  with  about  50  vertices  shown 
in  Figure  1(a)  to  the  optimized  result  shown  in  Figure  1(b)  took 
fewer  than  20  iterations.  Even  fewer  iterations  were  required  for 
the  open  curve. 

4  Applications 

In  this  section  we  present  two  applications  of  our  energy- 
minimizing  curves  in  an  interactive  context  where  the  user  pro¬ 
vides  a  rough  initial  estimate  of  the  location  of  the  contours. 
The  energy-minimizing  curve  approach  has  also  been  incorpo¬ 
rated  in  a  fully  automated  system  that  extracts  buildings  and 
roads  from  aerial  imagery  by  using  the  boundaries  of  a  region 
segmentation  to  generate  the  initial  estimates.  This  application 
is  described  in  detail  in  another  paper  also  appearing  in  these 
proceedings  [2]. 

4.1  Road  Delineation 

We  can  directly  apply  the  smooth  curve  model  of  the  previous 
section  to  the  problem  of  finding  road  boundaries  in  aerial  im¬ 
ages.  The  boundaries  of  such  roads  can  be  modeled  as  smooth 
curve*  that  are  approximately  parallel.  We  therefore  represent 
a  road  instance  by  a  smooth  polygonal  curve  forming  the  center 
of  the  road,  and  we  associate  with  each  vertex  i  of  this  skeleton 
a  width  f/*,  that  defines  the  two  curves  that  are  the  candidate 
road  boundaries.  We  define  the  photometric  energy  as  tin*  sum 
of  the  photometric  energies  of  the  two  boundary  curves,  and 


the  deformation  energy  as  the  sum  of  the  smoothness  term  de¬ 
scribed  above  and  an  additional  term  that  enforces  the  parallel 


constraint: 


We  have  applied  this  model  to  two  road-segments  in  the  aerial 
image  of  Figure  2(a).  Figure  2(b)  shows  the  initial  estimates  for 
two  road-segment  boundaries  and  Figure  2(c)  shows  the  final 
optimized  boundaries. 

This  image  is  an  excellent  example  of  a  situation  where  it  is 
difficult  to  use  a  standard  local  edge  operator,  such  as  the  Canny 
operator,  to  find  the  relevant  boundaries.  Bern  use  the  roads  are 
dirt  roads  with  ill-defined  edges,  no  single  edge-strength  thresh¬ 
old  yields  a  satisfactory  set  of  boundaries:  when  the  threshold 
is  too  high,  most  of  the  road  edges  arc  lost,  whereas  when  the 
threshold  is  too  low.  there  is  a  plethora  of  irrelevant  edges  (see 
Figure  T).  Furthermore,  the  road  boundaries  are  completely 
non  existent  at  the  intersections.  This  is  the  kind  of  situation  in 
which  the  predictive  power  of  the  geometric  model,  as  captured 
by  the  energy-minimizing  curve,  is  crucial. 

4.2  Building  Delineation 

Local  geometric  constraints,  such  as  smoothness,  are  insufficient 
when  we  wish  to  find  the  boundaries  of  more  complex  objects. 
In  such  case,  we  need  more  global  constraints.  For  example, 
to  find  the  boundaries  of  rectilinear  objects  (whose  boundary  is 
composed  of  straight  line-segments  forming  a  closed  region  in¬ 
tersecting  at  right  angles),  we  need  to  add  a  deformation  energy 
term  that  enforces  rectilinearity.  To  delineate  such  objects,  we 
use  a  set  of  straight  edges  which  are  polygonal  curves  with  only 
two  vertices  and  the  additional  energy  term 

O.  =  £((<?,  mod  t/2)2. 

where  0,  is  the  angle  of  an  edge  and  9  the  average  direction  of 
all  edges  modulo  r/2. 

Figure  -1(a)  shows  three  initial  sets  of  edges  that  have  been 
drawn  by  hand  but  could  have  been  discovered  by  an  edge  de¬ 
tector.  Two  of  them  correspond  to  actual  buildings  and  one  to  a 
tree.  Figure  4(b)  shows  the  final  optimized  results  after  imposing 
the  rectilinear  constraint.  In  Figure  4(c),  for  each  edge,  only  the 
pixels  that  are  edge-points  according  to  Definition  1  are  shown. 
While  most  of  the  pixels  belonging  to  building  edges  satisfy  the 
edge  criterion,  very  few  of  those  belonging  to  tree  boundaries 
do.  Computing  the  proportion  of  edge-points  in  the  optimized 
edges  thus  gives  us  an  efficient  way  to  measure  how  well  the  pho¬ 
tometric  boundaries  match  the  model  geometry,  this  criterion 
is  used  extensively  in  the  automated  system  described  in  [2]  to 
decide  whether  or  not  boundaries  of  a  given  shape  match  the 
actual  image  photometry. 

5  Summary  and  Conclusion 


W’e  have  presented  a.  technique  for  finding  object  boundaries 
that  integrates  both  photometric  ami  geometric  models  with  an 
initial  estimate  of  the  boundary.  The  mode]*  are  incorporated 
by  defining  an  energy  function  for  curve*  tfiat  i*  minimal  when 
the  models  are  exactly  satisfied.  I  he  initial  estimate  i -  u-ed 
a*  the  sifting  point  for  finding  a  local  minimum  <4  thC  emggy 
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function  by  embedding  the  initial  curve  in  a  viscous  medium  and 
soh.iig  the  equations  of  dynamics. 

The  strengths  of  this  “energy- minimizing  curve"  approach  are 
that  the  geometric  constraints  are  directly  used  to  guide  the 
search  for  a  boundary  and  that  the  edge  information  is  inte¬ 
grated  along  the  entire  length  of  the  curve,  thereby  providing  a 
large  support. 

We  have  shown  the  precise  relationship  between  optimized 
curves  and  a  standard  definition  of  an  edge,  and  discussed  how 
this  relationship  can  be  used  to  determine  when  an  optimized 
curve  should  be  accepted  as  a  candidate  edge  for  further  process¬ 
ing.  We  have  also  presented  methods  for  automatically  choosing 
the  optimization  parameters  and  how  they  should  change  over 
time.  We  found  that  these  contributions  were  crucial  to  our  abil¬ 
ity  to  embed  energy-minimizing  curves  in  the  automated  system 
presented  in  [2]  and  that  they  have  provided  us  with  the  ability 
to  find  photometrically  weak  boundaries  that  local  edge  detec¬ 
tors  simply  could  not  find  without  also  finding  many  irrelevant 
bou u darn  >. 


Appendix 


In  this  appendix.  we  show  that  a  necessary  and  sufficient  con¬ 
dition  for  an  o|>i'tt  curve  C  to  lie  a  local  minimum  of 
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vi f  ft  respect  to  .a/l  infinitesimal  deformations  is  that 
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v  here  the  curve  (’  is  para  met  prized  by  its  arc-length  :  f(.<t)  is  a 
vector  fittictioit  mapping  the  arc-length  .«  to  points  (j \y)  along 
the  curve;  n(.s)  is  t he  normal  to  t he  curve;  t(,<)  is  t he  tangent  to 
f he  curve:  and  ■)(.«)  is  its  curvature.  Furthermore.  Eq.  (7)  alone 
is  the  necessary  and  sufficient  condition  for  closed  curve1. 

To  prove  this  result,  consider  deformations  of  the  curve  C. 
which  we  shall  call  C\.  such  that  the  mapping  from  arc-length  s 
to  points  (. r.f /)  is  of  t he  form 

fv(.s)  =■  f(s)+  A  (l/(  s  tn(.s)  +  r(.s)t(s)) .  (!)) 

w  liere  i/ 1..)  and  —{si  are  arbitrary  continuous  and  differentiable 
fine  t  ion.  Win'll  C  is  a  closed  curve.  p(  s )  and  r( s )  are  must  rained 
ti>  be  periodie  of  period  !(’i.  ^ince  (’  is  a  local  ext  retmitn  »*f  t'((’ '» 
if  and  onlv  if 

tint 

d\ 

fur  .ill  ifi  s\  and  r|>  !.  wi>  n*»w  prov^  the  following  theorem. 
Theorem:  I  q.  I  1 0 >  holds  for  nil  //(.«)  and  n^l  if  and  only  if 

**(|.  i  7 1  and  i  * )  ar<*  satisfied. 

Proof:  Hy  (bTinit ion. 
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where  is  the  arc-length  of  (?».  We  can  rewrite  this  in  terms 
of  $  by  using  the  equality 

ds\  -  \f[(s)\ds.  (12) 


and  therefore,  to  within  the  non  zero  factor  of  1  f  (j  |f\ls)i  ds)*: 

TT1  -  5[/«f'W)|C.I'li''-]/in..)|iS 

-  [y  C7f  f x ( ^ ) )  | f  (  ( s ) |  r/.s j  J  \{'x(»)\ds 

=  J  |f»(*)|  +  6'(fx(.s))^Mrfs  [  |IJ(a)[rfa 

-  [/ 6’(fx(.s))|f;ts)|dsj  J  (Id) 

flip  first  term  we  need  to  evaluate  in  Eq.  (9)  is  </|f((s)|  /dX. 
for  which  we  need  tie.  following  equalities: 

^  =  C(.s,  =  t(„ 

r/.s 

=  t'(s)  =  7(«)n(.s) 


=  n'(-s)  =  -T(*)t(.s). 


Therefore. 


f'v(.s)  =  f'(-s)  +  A  [t/(s)n(.«)  +  tfis)n'(s)  +  r'( .s)t(.s-)  +  r(*)t'(.s)] 

=  ((•'■)  +  A  [:/(■' )n(*)  -  r/(-vh(.«)t(.s) 

,-r'(.s)t(s)  +  r(>)-,(.s)n(s)] 

=  t(s)  [l  +  A(r'(.s)  -  t;(s) 7(.s)] 

+  n(s)  [A(t/(s)  +  r(s)7(#))]  .  ( to) 

Since  tf>)  and  n(s)  are,  by  definition,  orthogonal  unit  vectors, 
we  have 

|fv(.s)|  =  \/[l  +  A( r'(s)  -  r;(.5h(.s))]J  +  [A(t/(s)  +  r(s  )■>(.<))]*. 

(16) 

ami  therefore  <1  |f((s)|  /d\  = 

[I  +  A(r'(s)  -  p(s)7(s))](t'(. s)  -  >i(s)~i(s))  +  X(rj'(s)  +  T(s)’)js))1 

\/ll  +  -Mr'(.s)  ~  ,f(-s)7(s))f  +  [•M'f'(-s)  +  ’■(■s)7('‘!))]2 

(17) 

f  ile  second  term  we  need  to  evaluate  is 
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Sul'st it ulitig  Eq.  (If) I-  (17).  and  ils')  into  Eq.  (II).  eval- 
uating  at  A  =  0.  and  mult i plying  by  the  u<>n-/ero  factor 

i/i  j  i  fit  w: 
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Dividing  by  f  ds  and  rearranging  terms,  we  have 
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A  necessary  and  sufficient  condition  for  the  first  integral  to  be 
zero  for  all  tj[s)  is  that  the  term  multiplying  7 ]{s)  be  zero,  which 
is  F.q.  (7). 

Integrating  the  second  integral  by  parts: 
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r(s) 


dG(Ks)) 
dt(s) 
dG[f(s)) 


I™ 


dt{s) 


ds  +  7(5)  (  G(f(s))  - 
ds, 


fG(i(s))ds\  |t’1 
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(19) 


therefore. 


0  =  r(|C|)(G(f(|C|))- 

-t(i»  (emo))-  (20) 

A  necessary  and  sufficient  condition  for  this  equality  to  hold  for 
all  r(.s)  is  that  both  terms  bo  zero,  which  is  Eq.  (S). 

Thus,  a  necessary  and  sufficient  condition  for  an  open  curve  C 
to  be  a  local  extremum  of  £{C)  is  that  both  Kq.(7)  and  (8)  hold. 

Furthermore,  for  closed  curves.  r(0)  =  r(|C|)  and  G(f(0))  = 
G(f(|C’|)):  hence  Eq.  (20)  is  always  satisfied.  Therefore,  a  nec¬ 
essary  and  sufficient  condition  for  a  closed  curve  C  to  be  a  local 
extremum  of  <f(C)  is  that  Eq.  (7)  holds. 

Q.K.D. 


Figure  1:  (a)  Initial  estimates  for  both  an  open  and  a  closed 
boundary. 

(b)  Final  outlines  after  optimization. 
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Figure  2:  (a)  An  aerial  image  showing  a  network  of  dirt  roads. 

(b)  Two  initial  estimates  of  the  road  outlines. 

(c)  Final  delineations  after  optimization. 


U)  (h) 


Figure  3:  Edge  images  generated  by  the  Canny  edge  detetec- 
tor  run  on  the  dirt  road  image  with  two  different 
sots  of  edge  strength  thresholds.  Note  that  when 
the  thresholds  are  too  high*  as  in  (a),  most  edges 
are  lost,  while  many  irrelevant  edges  are  found  when 
the  thresholds  are  dropped,  as  in  (h). 
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Figure  4:  (a)  Three  sets  of  edges  overlaid  on  an  aerial  image. 

(b)  The  edges  after  optimization  using  a  rectilinear- 
ity  constraint,  (c)  The  pixels  in  all  edges  that  satisfy 
our  definition  of  an  edge  pixel.  Note  that  while  most 
pixels  in  the  edges  that  are  building  edges  satisfy  our 
criterion,  almost  none  of  those  that  belong  to  tree 
edges  do. 
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Abstract 


I  lie  previous  implementations  of  our  Epipolar-Plane  Image  Anal- 
y s i >  mapping  technique  demonstrated  the  feasibility  and  benefits 
of  the  approach,  but  were  carried  out  for  restricted  camera  ge¬ 
ometries.  The  question  of  more  general  geometries  made  utility 
for  autonomous  navigation  uncertain.  We  have  developed  a  gen¬ 
eralisation  of  the  analysis  that  a)  enables  varying  view  direction, 
including  varying  over  time,  b)  provides  three-dimensional  con¬ 
nectivity  information  for  building  coherent,  spatial  descriptions 
of  observed  objects,  and  c)  operates  sequentially,  allowing  initi¬ 
ation  and  refinement  of  scene  feature  estimates  while  the  sensor 
is  in  motion.  To  implement  this  generalization  it  was  necessary 
to  develop  an  explicit  description  of  the  evolution  of  images  over 
time.  We  have  achieved  this  by  building  a  process  that  ire- 
ates  a  set  of  two-dimensional  manifolds  defined  at  the  zeroes  of 
a  three-dimensional  spatiolemporal  I, apiarian.  These  manifolds 
represent  explicitly  both  the  spatial  and  temporal  structure  of 
the  temporally-evolving  imagery,  and  we  term  them  spatiotew- 
poreil  surfaces.  The  surfaces  are  constructed  incrementally,  as 
tfie  images  are  acquired.  We  describe  a  tracking  mechanism  (fiat 
operates  locally  on  these  evolving  surfaces  in  carrying  out  three- 
dimensional  scone  reconstruction. 


With  these  constraints,  we  could  guarantee  that 

1.  individual  scene  features  would  be  observed  in  single  epipo- 
lar  planes  over  the  period  of  scanning; 

2.  images  of  these  planes  could  be  constructed  by  collecting 
corresponding  image  scanlines  in  successive  frames; 

3.  the  motion  of  scene  features  in  these  images  would  appear 
as  linear  tracks. 

We  t °i  med  these  image  planes  < pipoleir-plenn  images.  or  EI’Is, 
and  the  process  Epipolar-Plane  Image  Analysis. 

I . .!.  Problems  with  the  Previous  Approach 

in  the  earlier  publication  we  commented  on  our  principal  dissat¬ 
isfactions  with  the  approach,  and  the  limitations  which  would 
restrict  its  usefulness.  Summarizing,  these  were; 

Limit  at  ions: 

1. 1  Orthogonal  viewing  would  preclude  many  of  the  camera  at¬ 
titudes  one  would  expect  to  be  necessary  for  an  autonomous 
vehicle  noticeably  that  attitude  in  which  the  vehicle  is 
looking  along  its  direction  of  motion,  or  when  it  is  to  track 
some  particular  feature  and  follow  it  as  it  moves  across  the 


1.  INTRODUCTION 

/./.  Epipolar-Plane  Image  Analysis 

In  [2],  we  described  a  sequence  analysis  technique  we  developed 
for  use  in  obtaining  depth  estimates  for  points  in  a  static  scene. 
The  approach  bridged  the  usual  dichotomy  of  depth  sensing  in 
that  its  large  number  of  images  led  to  a  large  baseline  and  thus 
high  accuracy,  while  rapid  image  sampling  gave  minimal  change 
from  frame  to  frame,  eliminating  the  correspondence  problem. 
Rather  than  choosing  quite  disparate  views  and  putting  features 
into  correspondence  by  stereo  matching,  with  this  technique  we 
chose  to  process  massive  amounts  ot  similar  data,  but  wit  h  much 
simpler  and  more  robust  techniques.  The  technique  capitalized 
on  several  constraints  we  could  impose  on  the  image  acquisition 
process,  namely; 

1.  the  camera  moved  along  a  lineal  path: 

2.  it  acquired  images  at  equal  spacing  as  it  moved: 

3.  the  camera's  view  was  orthogonal  to  its  direction  of  travel. 
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A 2  A  constant  rate  of  image  acquisition  would  be  difficult  to 
guarantee,  and  probably  n  t  be  desirable  in  a  general  con¬ 
text.  Sampling  rates  will  be  affected  heavily  by  computa¬ 
tional  demands  on  the  system  and  vehicle  velocities  dictated 
by  higher  level  concerns. 

A.j  A  linear  path  would  be  an  unacceptable  or  highly  improbable 
trajectory  in  most  every  situation  except  extended  flight. 

A  i  Static  scours  are  the  has 7  likely  winds  blow,  clouds  move: 
often  a  moving  object,  in  a  scene  is  the  one  of  most  interest. 


I )  i  ss ;  i  tjs  factions: 

l)\  The  analysis  should  proceed  sequentially  as  the  imagery  is 
acquired.  To  insist  that  all  data  be  available  before  scene 
measurement  can  be  begun  would  eliminate  rum  of  t  he  p:  in 
cipni  goals  of  tin*  process  to  provide  timely  spatial  infor¬ 
mal  ion  for  a  vehicle  in  mot  ion. 

D>  I  In*  KIM  partitioning,  thonigh  its  -election  of  the  l»-mpo 
rai  over  the  spatial  analysis  of  images,  could  not  provide 
spatially  coherent  results.  It  produced  point  -ofs.  W“  at¬ 
tempted  clustering  opemlioiis  on  these.  1 , 1 1 1  weir  never  sat 
i  lied  with  such  a  po>/  Imr  approach:..  I  lie  proper  approach  to 
obtaining  spatial  coherence  in  our  results  would  begin  with 
in  >t  losing  il  in  I  he  (irst  place. 
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2.  NEW  APPROACH  TO  EPI  ANALYSIS 


2.1.  Generalizations 

We  have  developed  generalizations  to  our  earlier  approach  that 
enable  us  to  resolve  L\ .  L2  and  Oj  and  D2.  Arbitrary  and  vary¬ 
ing  camera  attitudes  and  velocities  are  permissible  in  our  new 
formulation,  and  we  process  the  data  sequentially  as  acquired, 
forming  estimates,  of  increasing  precision,  descriptive  of  spatial 
contours  rather  than  points.  The  generalization  also  suggests  a 
mechanism  for  dealing  with  the  non-linear  path  question  of  L3, 
although  we  have  not  pursued  this  as  yet. 

Z-4  rises  as  an  incompatibility  between  our  performance  desires 
for  a  vision  system  and  our  definition  of  the  task  we  choose  to 
address.  We  wish  to  build  3-dimensional  descriptions  of  scenes, 
and  it  is  inappropriate  to  expect  this  to  be  possible  if  our  view  of 
the  scene  is  undergoing  change  unrelated  to  our  active  pursuit  of 
observations.  In  [2]  we  discussed  this  motion  issue,  and  suggested 
how  to  recognize  its  presence  in  a  scene.  Once  distinguished  from 
the  static  elements,  it  would  be  possible  to  invoke  higher-order 
models  and  filters  to  estimate  these  objects’  dynamics  (see  Broida 
[4]  and  Gennery  [8]),  but  our  current  interest  is  in  modelling 
static  structure,  and  we  will  not  pursue  this  issue  here. 

In  common  with  our  earlier  work,  our  new  approach  involves  the 
processing  of  a  very  large  number  of  images  acquired  by  a  moving 
camera.  The  analysis  is  based  on  three  constraints: 

1.  the  camera’s  movement  is  restricted  to  lie  along  a  linear 
path: 

2.  the  camera's  position  and  attitude  at  each  imaging  site  are 
known; 

3.  image  capture  is  rapid  enough  with  respect  to  camera  move¬ 
ment  and  scene  scale  to  ensure  that  the  data  is.  in  general, 
temporally  continuous. 

Within  this  framework,  we  generalize  from  the  traditional  notion 
of  epipolar  lines  to  that  of  epipolar  planes  -  a  set  of  epipolar  lines 
sharing  the  property  of  transitivity.  We  formulate  a  tracking 
process  which  exploits  this  property  for  determining  the  position 
of  features  in  the  scene.  This  tracking  occurs  on  what  we  term 
the  spatiotcmporal  surface  -  a  surface  defining  the  evolution  of  a 
set  of  scene  features  over  time.  Critical  to  visualizing  this  space- 
time  approach  is  obtaining  an  understanding  of  the  geometry  of 
the  sensing  situation,  and  this  is  described  in  the  next  section. 

3.2.  Geometric  Considerations  of  Camera  Path  and  Altitude 
Die  camera  is  modelled  as  a  pin-hole  with  image  plane  in  front 
of  the  lens  (Fig.  1).  For  each  feature  P  in  the  scene  and  two 
viewing  positions  Ij  and  1  there  is  an  epipolar  plane,  which 
passes  through  P  and  the  line  joining  the  two  lens  centers.  This 
plane  intersects  the  two  image  planes  along  corresponding  cpipei- 
lar  lines  (note  that.  here,  intersection  and  projection  are.  in  a 
sense,  equivalent).  An  e  pipnle  is  t  lie  intersection  of  an  image 
plane  with  the  line  joining  t ho  h>ns  centers.  In  motion  analysis, 
an  epipole  is  often  , eferred  to  as  I  lie  focus  of  expansion  (FOE) 
because  t  he  epipolar  lines  radiate  from  i*  Tim  rnnw,ra  moves  in  a 
straight  line,  and  the  lens  centers  at  the  various  viewing  positions 
lie  along  this  line  Here,  the  I- OK  is  the  camera  path.  This  struc¬ 
turing  divides  the  scene  into  a  pencil  of  planes  passing  through 
the  camera  path.  We  view  this  as  a  cylindrical  coordinate  system 
with  axis  the  camera  path,  angle  defined  hv  the  epipolar  plane, 
and  radius  the  distance  of  the  feature  from  the  axis.  Note  that 
a  scene  feature  is  restricted  to  a  single  epipolar  plane,  and  any 
scene  feat  tires  at  t  he  same  angle  ( w  ithin  the  discret  ization )  share 


that  epipolar  plane.  This  means  that,  as  in  our  earlier  work,  the 
analysis  of  a  scene  can  be  partitioned  into  a  set  of  analyses,  one 
for  each  plane,  and  these  planes  can  be  processed  independently. 
In  section  3  we  describe  how  we  organize  the  data  to  exploit  this 
constraint. 


Fig.  2  shows  a  simple  motion  with  a  camera  travelling  orthog¬ 
onal  to  its  direction  of  view.  This  corresponds  to  the  situation 
depicted  by  V2  of  Fig.  1.  Here,  the  epipolar  lines  for  a  feature 
such  as  P  are  horizontal  scanlines,  and  these  occur  at  the  same 
vertical  position  (scanline)  in  all  the  images.  This  is  the  camera 
geometry  normally  chosen  for  computer  stereo  vision  work.  Each 
scanline  is  a  projected  observation  of  the  features  in  an  epipolar 
plane.  The  projection  of  P  onto  these  epipolar  lines  moves  to  the 
right  as  the  camera  moves  to  the  left.  If  one  were  to  lake  a  single 
epipolar  line  (scanline)  from  each  of  a  series  of  images  obtained 
with  this  camera  geometry  and  compose  a  spatiotcmporal  image 
(with  horizontal  being  spatial  and  vertical  being  temporal),  one 
would  see  a  pattern  as  in  Fig.  3.  For  this  type  of  motion  feature 
trajectories  are  straight  lines,  as  ran  ho  seen.  This  is  the  case 
handled  by  our  previous  analysis.  If.  on  t he  other  hand,  the  cam¬ 
era  were  moving  with  an  attitude  as  shown  at  l  3  in  Fig.  1,  the 
set  of  epipolar  lines  would  form  a  pattern  as  shown  in  Fig.  4.  For 
this  type  of  motion  feature  trajectories  are  hyperbolas.  Notice 
that  the  epipolar  lines  are  no  longer  scanlines  they  are  oriented 
radially  and  pass  through  the  FOF.  Allowing  the  camera  to  vary 
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its  attitude  along  the  path  gives  rise  to  spatiotemporal  images  as 
shown  in  Fig.  5.  Here,  the  epipolar  lines  form  no  fixed  pattern, 
and  the  paths  of  features  are  neither  linear  nor  hyperbolic  -  in 
fact  they  are  arbitrary  curves. 


Fig.  3.  Orthogonal  Viewing. 


Fig.  4.  Fixed,  Non-orthogonal  Viewing. 
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Fig.  5.  View  Direction  Varying. 

2.3.  Keeping  Ike  Problem  Linear 

At  the  outset,  our  goal  was  to  determine  the  position  of  observed 
features,  and  we  do  this  by  tracking  their  appearance  in  these 
epipolar  planes.  Obviously  in  the  case  of  orthogonal  viewing 
(eg.,  as  in  Fig.  3  and  at  V2  in  Fig.  1),  the  tracking  is  linear. 
For  general  camera  attitudes,  including  varying,  it  is  non-linear. 
Computational  considerations  make  it  extremely  advantageous 
for  the  tracking  to  be  posed  as  a  linear  problem.  To  maintain  the 
linearity  regardless  of  viewing  direction  we  find  not  linear  feat  ure 
paths  in  the  EPIs  (Figures  3  through  5).  but  linear  paths  in  a 
dual  space.  Our  insight  here  (see  Marimont  [13])  is  that  no  matter 
where  a  camera  roams  about  a  scene,  for  any  particular  feature, 
the  lines  of  sight  from  the  camera  principal  point  through  that 
feature  in  space,  determined  by  the  line  from  the  principal  point 
through  the  point  in  the  image  plane  where  the  projected  feature 
is  observed,  all  intersect  at  the  feature  (modulo  the  measurement 
error).  The  duals  of  these  lines  of  sight  lie  along  a  line  whose  dual 
is  the  scene  point  (see  Fig.  6):  fitting  a  point  to  the  lines  of  sight  is 
a  linear  problem.  This.  then,  gives  us  a  metric  for  linear  tracking 
of  features:  we  map  feature  image  coordinates  to  lines  of  sight, 
and  use  an  optimal  estimator  to  determine  the  point  minimizing 
the  variance  from  those  lines  of  sight  (we  currently  model  only 
uncertainties  in  the  image-plane  observations,  and  not  those  in 
the  sensor  position). 


DUAL  SPACE  I  SCENE  SPACE 

Fig.  6  Line  of  Sight  Duality. 

We  must,  however,  have  a  mechanism  for  extracting  the  observa¬ 
tions  of  features  from  the  individual  images  in  which  they  occur, 
and  grouping  them  by  epipolar  plane.  Only  in  the  simple  case  of 
viewing  angle  orthogonal  to  the  motion  is  this  grouping  trivial 
(Fig.  3).  and  this  was  the  case  our  earlier  work  addressed.  To 
obtain  this  structuring  in  the  general  cases,  we  could  take  one 
of  two  approaches.  The  first  is  to  transform  the  images  from 
the  Cartesian  space  in  which  they  are  sampled  to  an  epipolar 
representation  (see  Baker  [1]  and  Jain  [12]).  Because  of  alias¬ 
ing  effects  and  non-linearities  in  the  mapping,  we  prefer  to  avoid 
this.  Probably  the  best  solution  would  be  to  use  a  sensor  which 
delivers  the  data  directly  in  this  form  (a  spherical  sensor  having 
meridian  scanning  would  accomplish  this  (see  Gibson  [9])).  Such 
a  sensor  not  yet  being  available,  we  choose  the  second  approach, 
transforming  the  features  we  detect  in  image  space  to  the  de¬ 
sired  epipolar  space,  the  cylindrical  coordinate  system  of  Fig.  1. 
The  structure  we  have  developed  for  implementing  this  brings  us 
several  other  advantages,  as  the  next  section  describes. 

3.  THE  SPATIOTEMPORAL  SURFACE 

3.1.  Structuring  the  Data  -  Spatiotemporal  Connectivity 
We  collect  the  data  as  a  sequence  of  images,  in  fact  stacking 
them  up  as  they  are  acquired  into  a  spatiotemporal  volume  as 
shown  in  Fig.  7.  As  each  new  image  is  obtained,  we  construct 
its  spatial  and  temporal  edge  contours.  These  contours  are  three- 
dimensional  zeros  of  the  Laplacian  of  a  chosen  three-dimensional 
Gaussian  (see  also  Buxton  [5]  and  lleeger  [11]  for  their  use  of 
spatiotemporal  convolution  over  an  image  sequence),  and  the 
construction  produces  a  spatiotemporal  surface  enveloping  the 
signed  volumes  (note  that  in  two  dimensions  edge  contours  en¬ 
velop  signed  regions).  The  spatial  connectivity  in  this  structure 
lets  us  explicitly  maintain  object  coherence  between  features  ob¬ 
served  on  separate  epipolar  planes;  the  temporal  connectivity 
gives  us.  as  before,  the  tracking  of  features  over  time.  See  [3] 
for  a  description  of  how  these  surfaces  are  constructed. 
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Fig.  7  Spatiotemporal  Volume. 

The  need  for  maintaining  this  spatial  connectivity  can  be  ob¬ 
served  by  viewing  our  earlier  results  in  [2],  one  set  of  which  is 
shown  in  Fig.  8.  There,  in  processing  the  EPIs  independently,  we 
obtained  separate  planes  of  results.  Wishing  to  exploit  the  fact 
that  there  should  be  some  spatial  coherence  between  these  sets  of 
points,  we  used  proximity  of  the  resulting  estimates  on  adjacent 
planes  to  filter  outliers.  Features  not  within  the  error  (covari¬ 
ance)  ellipses  of  those  above  or  below  them  were  discarded.  The 
remaining  results  (Fig.  8)  were  sparse  and  fragmented,  however 
the  problem  did  not  lie  with  this  filtering,  but  with  the  loss  of 
spatial  connectivity  in  the  first  place.  Our  separation  of  the  data 
into  EPIs  and  then  subsequent  independent  processing  of  these 
lost  the  spatial  connectivity  apparent  in  the  original  images.  We 
maintained  instead  the  temporal  connectivity  that  was  critical  to 
the  feature  tracking.  For  spatial  connectivity  in  the  scene  recon¬ 
struction.  spatial  connectivity  in  the  imagery  must  be  preserved. 
The  next  three  figures  present  a  simplified  example  of  this  spatial 
and  temporal  connectivity.  Fig.  9  shows  a  sequence  of  simulated 
images  depicting  a  camera  zooming  in  on  a  set  of  rectangles. 
Fig.  10  shows  the  spatintempoTal  surfaces  in  a  crossed-eye  stereo 
wire-frame  representation,  and  these  are  shown  again  in  rendered 
form  in  Fig.  11.  The  spatial  and  temporal  interpretation  of  these 
surfaces  should  be  quite  apparent. 


Fig.  9.  Simulation:  Linear  Path,  Motion  toward  Center. 


Fig  10.  Spatiotemporal  Surfaces  from  Fig.  9  Imagery. 


Fig.  11.  Surfaces  of  Fig.  10  Rendered  for  Display. 


Fig.  8.  Orthogonally- Viewed  Scene:  Results  (displayed  for  crossed-eye  viewing). 
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Fig.  12.  Sequence  Is'  and  128th  Images.  Fig.  13.  is<  and  128‘A  Images  at  |  Resolution 


Fig.  14.  Spatiotemporal  Surface  Representation,  First  10  Frames. 


Fig.  15.  Epipolar-Plane  Surface  Representation. 


In  spatiotemporal  surface  descriptions,  feature  observations  bear 
(u,  v,  t)  coordinates,  and  are  spatiotemporal  voxel  facets.  Fig.  14 
shows  a  mesh  description  of  the  facets  for  the  spatiotemporal 
surfaces  associated  with  the  forward-viewing  sequence  whose  first 
and  last  images  are  depicted  in  Fig.  12.  These  images  are  much 
more  complex  than  those  of  Fig.  9.  In  the  interest  of  clarity,  the 
surface  representations  we  will  show  in  the  remaining  figures  are 
based  on  a  simplified  version  of  this  imagery  one  eighth  the 
linear  resolution  of  the  originals.  Fig.  13  shows  these  two  frames 
at  the  reduced  resolution. 

3.2.  Structuring  the  Data  -  Epipolar-Plane  Representation 
As  mentioned  in  the  previous  section,  for  non-orthogonal  view¬ 
ing  directions,  epipolar  lines  arc  not  distinguished  by  the  spatial 
v  sranline  coordinate.  To  obtain  this  necessary  structuring  we 
develop  within  this  spatiotemporal  description  an  cmhcrWcr/ rep¬ 
resentation  that  makes  the  epipolar  organization  explicit.  Over 
each  of  the  sequential  images,  we  transform  the  (u,c,f)  coordi¬ 
nates  of  our  spatiotemporal  zeros  to  ( r.h.6 )  cylindrical  coordi¬ 
nates  (9  indicates  the  epipolar- plane  angle  (9  g  [0, 2tr]).  the  quan¬ 
tized  resolution  in  9  is  a  supplied  parameter,  and  the  transform 


for  each  image  is  determined  by  the  particular  camera  parame¬ 
ters).  In  this  new  coordinate  system,  we  build  a  structure  similar 
to  our  earlier  EPI  edge  contours,  but  dynamically  organized  by 
epipolar  plane.  This  is  done  by  intersecting  the  spatiotempo¬ 
ral  surfaces  with  the  pencil  of  appropriate  epipolar  planes1  (as 
Fig.  1).  We  weave  the  epipolar  connectivity  through  the  spa- 
tiotemporai  volume,  following  the  known  camera  viewing  direc¬ 
tion  changes.  Fig.  15  shows  a  sampling  of  the  spatiotemporal 
surfaces  as  they  intersect  the  pencil  of  epipolar  planes  (every  fifth 
plane  is  depicted).  You  will  notice  the  obvious  radial  flow  pat¬ 
tern  away  from  the  epipole  (FOE).  Fig.  Hi  shows  seven  of  these 
surfare/plane  intersections,  along  with  the  associated  hounding 
planes  (refer  to  Fig.  1).  The  edge  that  all  share  is  the  camera 
path  (the  epipole).  Fig.  1  7  isolates  a  single  surface  from  the  lop 
left  of  Fig.  1-1,  and  shows  it's  spatiotemporal  structure.  Fig.  18 
shows  the  same  surface  structured  by  its  epipolar-plane  compo¬ 
nents.  linker  [3]  gives  details  of  this  intersection  operation  on  the 
spatiotemporal  surface. 

’Notice  that,  with  view  direction  varying,  the  underlying  epipolar  plane 
may  he  far  from  planar  in  (ii.e, /}  space  it  may  undulate  in  a  manner 
similar  to  that  in  which  Fig.  .5  varies  from  Fig.  t.  and  for  similar  reasons. 
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Fig.  16.  Intersection  of  7  Epipolar  Planes  with  Spatiotemporal  Surfaces  (30  frames). 


3.3.  Feature  Tracking  and  Estimation 

In  Fig.  19  we  show  the  tracking  of  scene  features  on  the  spa¬ 
tiotemporal  surfaces  in  the  vicinity  of  this  surface,  with  the  final 
pair  showing,  in  crossed-eye  stereo  form,  the  result  of  the  tracking 
after  10  frames.  The  coding  is  as  follows:  initiation  of  a  feature 
tracking  is  marked  by  a  circle;  the  leading  ohroroatinu  of  a.  feature 
(active  front)  is  shown  as  an  X;  lines  join  feature  observations; 
5  observations  (an  arbitrary  number,  2  may  be  sufficient)  must 
be  acquired  before  an  estimate  is  made  of  the  feature’s  position 
-  at  that  point  an  initial  batch  estimate  is  made,  and  a  Kalman 
filter  (see  Gelb  [7],  Mikhail  [14])  is  turned  on  and  associated 
with  the  feature  -  this  initiation  of  a  Kalman  filter  is  coded  by 
a  square;  where  two  observations  merge,  the  tracking  is  stopped 
and  the  features  are  entered  into  the  database  -  this  is  coded  by 
a  diamond2. 

As  mentioned  earlier,  observations  are  expressed  as  line-of-sight 
vectors,  and  these  are  represented  in  the  epipolar  plane  by  the 
homogeneous  line  equation  ax  +  by  —  c  =  0.  For  the  initial  batch 
estimation,  the  coordinates  (X)  of  the  feature  are  the  solution 
of  the  normal  equations  for  the  weighted  least  squares  system: 
X  =  (HtWH|->HtWC,  where  H  is  the  m  x  2  matrix  of  (a,,  6,) 
observations.  C  is  the  vector  of  c,.  and  W  is  the  diagonal  matrix 
of  observation  weights,  determined  by  the  distance  from  the  cam¬ 
era  to  the  observed  feature  at  observation  position  i.  We  estimate 
X  first  without  weights,  then  compute  the  weighted  solution  and 
the  desired  covariance  matrix.  V.  Given  a  current  estimate  X,_, 

’Dickmanns  [6]  and  Hallarn  [10]  describe  vehicle  navigation  controllers  that 
similarly  work  sequentially,  utilizing  Kalman  and  other  filters  for  estimat¬ 
ing  motion  parameters. 


and  covariance  V,_; ,  the  Kalman  filter  at  observation  i  updates 
these  as:  „  _ 

K,  =  V<_/H??/[HiVi_,HF  +  Wi] 

V,  =  [I  -  K;H;)V 

X;  =  Xj_,  +  K, ■[<:,'  —  H,X;_/] 

K,  is  the  2x1  Kalman  gain  matrix,  w,  is  the  observation  weight, 
a  scalar,  based  on  the  distance  from  the  camera  at  observation 
position  i  to  X,_; . 

The  tracking  of  an  individual  feature  is  depicted  in  Fig.  20.  The 
camera  path  runs  across  the  figure  from  the  lower  left.  Lines  of 
sight  are  shown  from  the  camera  path  through  the  observation 
of  the  feature  at  the  upper  right.  As  the  Kalman  filter  is  begun 
(T4),  an  estimate  (marked  by  an  x)  and  confidence  interval  (the 
ellipse)  are  produced.  As  further  observations  are  acquired,  the 
estimate  and  confidence  interval  are  refined.  Tracking  continues 
until  either  the  feature  is  lost,  or  the  error  term  begins  to  increase 
(suggesting  that  observations  not  related  to  the  tracked  feature 
are  beginning  to  be  included3).  Note  that  although  a  single  fea¬ 
ture  is  depicted  here,  it  is  part  of  a  spatiotemporal  surface.  This 
means  that  we  have  explicit  knowledge  of  those  other  features 
to  which  this  is  spatially  adjacent.  Fig.  21  shows  a  contour  -  a 
connected  set  of  features  on  such  a  surface  -  observed  over  time 
as  it’s  shape  evolves.  Such  contours  are  being  constructed  and 
refined  over  the  entire  image  as  the  analysis  progresses.  Our  cur¬ 
rent  representation  of  scene  structure  is  based  on  these  evolving 
contours. 

’or  the  zero-crossing  is  erroneous,  or  the  feature  is  not  stationary,  or  the 
feature  is  a  contour  rather  than  a  single  point  in  space,  or  ... 
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3. 4-  Generality  from  the  Spatiotcmporal  Surface 
A  crucial  constraint  of  the  current  epipolar-plane  image  analysis 
is  that  having  a  camera  moving  along  a  linear  path  enables  u.s 
to  divide  the  analysis  into  planes,  in  fact  the  pencil  of  planes  of 
Fig.  1  passing  through  the  camera  path.  With  this,  we  are  as¬ 
sured  that  a  feature  will  be  viewed  in  just  a  single  one  of  these 
planes,  and  its  motion  over  time  will  be  confined  to  that  plane. 
Another  crucial  constraint  is  the  one  we  generalized  from  the 
orthogonal  viewing  case  we  know  that  the  set  of  line-of-sight 
vectors  from  camera  to  feature  over  time  will  all  intersect  at  that 
feature,  and  determining  that  feature's  position  is  a  linear  prob¬ 
lem.  This  latter  constraint  does  not  depend  upon  the  linear-path 
constraint.  In  fact,  the  problem  would  remain  linear  even  if  the 
camera  meandered  in  three  dimensions  all  over  the  scene.  This 
knowledge  gives  us  a  possibility  of  removing  the  restriction  that 
the  camera  path  be  linear.  All  that  the  linear  path  guarantees  is 
that  the  problem  is  divisible  into  epipolar  planes.  If  we  lose  this 
constraint,  then  we  cannot  restrict  our  feature  tracking  to  sep¬ 


arate  planes.  The  features  will,  however,  still  form  linear  paths 
in  the  space  of  line-of-sight  vectors,  and  our  spatiotomporal  sur¬ 
face  description  is  an  appropriate  representation  for  doing  this 
non- planar,  but  still  linear,  tracking.  The  motion  of  features 
will  give*  us  ruled  surfaces  in  this  space  of  vectors,  with  the  rules 
(zeros  of  gaussian  curvature)  revealing  the  positions  of  the  fea¬ 
ture’s  in  the  scenic  (visualize  pick-up-sticks  jammed  in  a  box,  with 
the  sticks  being  the  rules).  This  generality  suggests  that  there 
is  even  broader  application  for  the’  tochnUpio  than  we  had  ini¬ 
tially  thought.  Of  course,  it  would  be*  possible  here  te>  use  the 
pairwise  epipolar  constraints  between  images  to  constrain  rule 
tracking  e>»  the  spatiotomporal  surface.  The  constraint  would 
only  apply  pairwise,  as  the  images  will  not  have  the'  transitivity 
property  cited  earlier.  It  is  also  worth  noticing  that,  when  the 
camera  attitude  ami  position  parameters  are  not  provided,  the 
spatiotomporal  surface  contains  everything  that  is  necessary  for 
determining  them;  but  this  is  another  problem,  one  which  we 
hope  to  look  at  in  the  near  future. 
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4.  CONCLUSIONS 


We  showed,  in  our  earlier  work,  the  feasibility  of  extracting  scene 
depth  information  through  Epipeeleir- Plane  Image  Analysis.  Our 
theory  applies  for  any  motion  where  the  lens  center  moves  in  a 
straight  line,  with  the  earlier  implementation  covering  the  special 
case  of  camera  sites  equidistant  and  viewing  direction  orthogo¬ 
nal  to  the  camera  path.  The  generalizations  obtained  through 
spatiotemporal  surface  analysis  bring  us  the  advantages  of: 

•  incremental  analysis; 

•  unrestricted  viewing  direction  (including  direction  varying 
along  the  path); 

•  spatial  coherence  in  our  results,  providing  connected  surface 
information  for  scene  objects,  rather  than  point  estimates 
structured  by  epipolar  plane; 

•  the  possibility  of  removing  the  restriction  that  fixes  us  to  a 
linear  path. 

The  current  implementation,  running  on  a  Symbolics  3000,  pro¬ 
cesses  the  spatiotemporal  surfaces  at  a  lKIIz  voxel  rate,  with 
the  associated  intersecting,  tracking,  and  estimation  procedures 
bringing  this  down  to  about  150Hz,  75%  of  which  is  consumed 
in  the  surface  intersection  (the  surface  intersection  would  not  be 
required  if  we  had  a  sensor  of  the  appropriate  geometry).  Doth 
the  feature  tracking  and  the  surface  construction  computations 
are  well  suited  to  MIMD  (or  perhaps  SIMD)  parallel  implemen¬ 
tation.  With  these  considerations,  and  the  process’s  inherent 
precision  and  robustness,  spatiotemporal-surface-based  epipolar- 
plane  image  analysis  shows  great  promise  for  tasks  in  realtime 
autonomous  navigation  and  mapping. 
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Abstract 

We  describe  a  three-dimensional  surface  construction  process 
designed  for  the  analysis  of  image  sequences.  Named  the  Weav¬ 
ing  1  Vail,  the  process  operates  over  images  as  they  arrive  from 
the  sensor,  knitting  together,  along  a  parallel  frontier,  con¬ 
nected  descriptions  of  images  as  they  evolve  over  time.  De¬ 
veloped  to  support  a  tracking  mechanism  for  recovering  the 
three-dimensional  structure  of  a  scene  being  traversed,  other 
applications  have  since  become  apparent.  These  include  ren¬ 
dering  and  computation  on  tomographic  medical  data,  display 
of  higher-dimensioned  analytic  functions,  edge  detection  on  the 
scale-space  surface,  and  display  and  analysis  of  material  frac¬ 
ture  data.  We  present  here  displays  from  some  of  these  appli¬ 
cations.  The  Weaving  Wall  may  be  of  use  in  representing  the 
evolution  of  any  two-dimensional  imagery  varying  in  a  nearly 
continuous  manner  along  a  third  dimension.  We  are  currently 
looking  into  extending  t lie  processing  to  four  dimensions. 

1.  SURFACES  OF  EVOLUTION 

/./.  Rackground  to  the  Development 

Images  of  real  objects  have  an  inherent  spatial  coherence  that 
i.  general  is  quite  obvious.  This  coherence  persists  as  the  ob¬ 
jects  are  smoothly  moved  about  or  viewed  from  varying  perspec¬ 
tives.  Indeed  it  is  often  the  case  that  clarity  in  the  perception  of 
a  scene's  structure  is  enhanced  through  either  object  or  viewer 
motion.  Here,  the  temporal  coherence  in  the  imagery  reinforces 
that  observed  spatially.  The  research  described  in  this  paper 
developed  from  our  efforts  in  generalizing  Epipolar-Plane  Image 
Analysis  [3]  to  cases  where  both  the  camera  and  the  objects  in 
the  scene  are  free  to  move  about.  Although  we  have  not  attained 
all  of  this  yet,  the  process  we  describe  here  is  supporting  these 
developments. 

Coupled  with  these  viewing  generalizations  was  a  desire  to  cnabte 
sequential  processing  of  the  image  data,  all  the  while  maintaining 
an  evolving  description  of  the  scene.  Two  items  in  particular  in 
this  effort  complicated  the  analysis  and  necessitated  fairly  radi¬ 
cal  redesign  from  our  earlier  system:  a  desire  to  handle  sequences 
with  varying  camera  attitude,  and  the  determination  to  produce 
spatially  coherent  results.  In  our  earlier  work,  we  retained  the 
temporal  continuity  from  sequences  of  images  and  used  it  quite 
effectively  in  tracking  and  localizing  features  in  the  scene,  but. 
in  so  doing,  discarded  the  spatial  continuity  apparent  in  the  in¬ 
dividual  images.  Our  broader  goals  for  this  research  made  us 
realize  that  this  was  information  we  must  not  lose  (see  [2])  our 
computations  will  depend  on  both  temporal  and  spatial  image 
structure.  This  meant  that  we  needed  to  develop  a  mechanism 
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for  building  an  explicit  description  of  the  spatial  and  temporal 
evolution  of  images  over  time. 

1.2.  What  Does  It  Mean  For  An  Image  To  Evolve? 

If  objects  move  or  a  camera  moves  about  a  scene,  the  projections 
of  scene  features  will  move  about  in  the  image.  If  the  motions 
are  sufficiently  fine,  there  will  be  no  difficulty  in  following  the 
various  components  of  the  scene  in  their  independent  travels. 
Consider,  as  we  do  for  the  rest  of  this  discussion,  that  it  is  the 
camera  which  is  undergoing  motion  while  viewing  a  static  scene. 
If  the  camera  slides  left  at  right  angles  to  its  direction  of  view,  we 
would  expect  to  see  the  scene  slide  right,  with  the  nearer  objects 
moving  most  rapidly  out  of  view  at  the  right.  If  the  camera  is 
instead  looking  directly  along  its  direction  of  motion,  we  would 
experience  a  looming  effect  from  objects  lying  straight  ahead,  and 
sec  those  off  to  the  sides  slowly  accelerate  then  flv  out  of  sight 
as  they  approach  the  edges  of  the  frame.  If  we  had  a  perfect 
segmentation  scheme1,  we  might  expect  to  be  able  to  outline  an 
object  in  the  individual  frames  in  which  it  is  seen,  and  collect 
all  of  these  together  into  a  cone  following  the  path  it  took  as  it 
moved  about  the  image  and  then  disappeared  from  view.  This 
outline  would  change  size  and  shape  as  the  object  came  closer 
or  receded,  and  as  it  presented  different  parts  of  its  surface  to 
view.  This  cone,  a  surface  or  two-dimensional  manifold  in  a  three- 
dimensional  space,  would  define  the  evolution  of  the  projection 
of  that  object  over  time.  If  we  could  track  all  such  outlines  for 
all  objects  in  the  scene,  we  would  say  we  had  a  description  of 
the  evolution  of  the  imagery  over  time.  This  is  the  description 
we  want.  The  outlines  encode  spatial  coherence,  and  along  their 
axes  the  rones  maintain  the  objects'  temporal  coherence. 

Not  having  perfect  segmentation,  we  will  not  expect  to  have  dis¬ 
joint  outlines  for  all  objects  in  an  image.  Given  the  variabil¬ 
ity  of  Sighting  conditions,  sensor  noise,  refections  and  texturing, 
etc.,  we  might  be  more  realistic  in  expecting  nothing  short  of  a 
hideous  swirl  of  contours.  Hut  given  a  continuous  deformation  in 
the  imagery,  even  this  hideous  swirl  will  sweep  out  a  surface  as 
wo  watch  it  in  space-time.  We  have  developed,  and  describe  in 
this  paper,  a  process  which  builds  a  representation  of  the  spatial 
and  temporal  structure  of  these  siirlaces. 

I  he  companion  paper  ;2j  specifies  the  computational  utility  ol 
this  description  in  the  context  of  our  spatiotemporal  tracking 
and  surface  reconst  met  ion  work.  Our  intention  in  this  paper  is 
to  describe  t  lie  development  and  operation  of  the  surface  building 
algorithm,  to  indieale  applications  in  which  it  has  proven  to  be 
valuable,  and  to  .suggest  others  in  which  it  may. 

'This  iv  probably  a  mcaniiiubss  premia*. 
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f..i.  Is  This  Surface  Building  A'err? 

Constructing  surfaces  in  3-D  data  such  as  this  lias  been  done 
before,  although  not  in  a  manner  that  could  be  effective  for  our 
needs.  Artzy  el  at.  in  [1]  describe  what  has  become  the  princi¬ 
pal  approach  to  surface  reconstruction  from  sensed  tomographic- 
style  data;  it  is,  in  fact,  the  only  surface  constructor  of  note.  In 
their  approach,  which  was  developed  in  the  context  of  surface 
building  from  medical  CT  data,  each  surface  is  processed  sepa¬ 
rately.  The  processing  begins  with  the  selection  of  a  seed  voxel 
of  the  desired  density  (measured  in  Hounsfeld  units).  A  sequen¬ 
tial  recursive  search  is  then  used  to  traverse  the  implied  surface, 
forming  a  connected  set  of  all  those  facets  positioned  between  ad¬ 
jacent  voxels  where  one  has  density  less  than  and  the  other  has 
density  greater  than  or  equal  to  the  chosen  value.  Since  surface 
traversal  is  by  a  connected-component  search  through  the  3-D 
volume,  all  of  the  data  must  be  acquired  beforehand.  The  sur¬ 
faces  traversed  in  this  manner  are  guaranteed  to  be  closed.  Once 
one  surface  has  been  traversed,  the  next  may  be  begun.  This  is 
called  the  cuberille  model  of  surface  reconstruction. 

Although  the  description  we  seek  is  closely  related  to  this,  our 
requirements  demanded  a  very  different  approach: 

1.  all  the  surfaces  in  the  sequence  are  of  interest  to  our  tracker, 
not  just  selected  ones,  and  we  need  to  follow  them  all; 

2.  our  aim  of  autonomous  navigation  through  sequential  pro¬ 
cessing  means  that  we  would  never  expect  or  desire  to  have 
all  the  data  available  in  advance  our  data  (images  from  a 
camera)  are  obtained  as  we  move,  and  must  be  processed  as 
they  are  acquired: 

3.  a  recursive  traversal  of  individual  surfaces  is  not  appropriate, 
nor  is  processing  the  surfaces  separately  in  sequence  -  all 
surfaces  must  be  built  incrementally  and  in  parallel  as  each 
new  image  is  acquired; 

1.  integral  positioning  of  facets  is  insufficiently  precise  -  for 
accurate  mapping  we  need  sub-pixel  resolution  in  surface 
definition. 

•5.  simple  density  or  intensity  themselves  are  inappropriate  for 
surface  definition  -  we  track  the  evolution  of  image  features 
over  time,  and  define  our  features,  and  hence  our  surfaces, 
as  being  located  at  the  zeros  of  Laplacians  of  a  Gaussian: 

6.  we  will  have  need  of  various  sorts  of  computation  on  the 
local  surface  as  it  is  being  constructed,  and  the  organization 
of  the  process  must  support  this; 

Kaclt  of  our  design  objectives  differs  in  important  ways  from  [lj. 
but  the  one  most  distinguishing  of  our  approach  is  point  3.  Moth 
sequential  processing  and  a  capability  for  parallel  implementation 
are  crucial  for  a  realistic  tracking  system. 

/..>.  Sitrfnn  Building  Opt  ration 

kite  surface  constructor  we  hate  developed  operates  on  images 
sequentially  as  they  are  acquired,  knitting  together  a  connected 
representation  for  the  spatial  and  temporal  evolution  of  the  se¬ 
quence  over  time.  It  maintains  the  continuity  of  feature  paths 
irrespective  of  changes  in  viewing  angle,  or  a  varying  rate  of 
image  acquisition:  it's  sole  requirement  of  the  data  is  that  be¬ 
tween  frames,  surfaces  move  in  some  direction  no  more  than  their 
width*.  I  he  processor  arts  as  a  Zoom  during  surface  construction, 
with  a  wall  of  accumulators  meeting  ear  h  image  in  sequence  and 
weaving  its  elements  into  the  mesh  of  surfaces  it  has  prepared 
behind  it.  From  this  action  we  give  it  its  name,  the  Wearing 
Wall. 

^Otherwise  they  would  he  disjoint,  .end  form  two  volumes,  not  one. 


Sittcc  the  principal  reason  for  developing  the  i  1 V annij  Wall  was 
for  use  in  motion  sequence  analysis,  we  will  develop  its  operation 
in  the  context  of  Laplacians.  For  additional  applications  which 
we  will  touch  on  later,  other  measures  may  be  more  appropriate 
-  for  example,  we  use  Hounsfeld  density  when  we  process  CT 
data,  and  track  not  density  zeros,  but  zeros  with  respect  to  a 
bias,  the  chosen  surface  density  value.  But  the  distinction  is 
only  incidental  to  the  development,  and  we  will  work  for  now 
with  the  Laplacians. 

2.  WEAVING  WALL  DESIGN  AND  IMPLEMENTATION 

2.1.  Local  Surface  Elements  from  the  Volumetric  Data 
One  characteristic  presented  in  the  derivation  of  the  Artzy  ct 
al  surface  reconstruction  process,  a  binary  relationship  on  the 
voxels,  holds  for  an  important  element  of  our  processing.  In 
tomographic  reconstruction,  voxels  bear  density  measures,  and 
for  a  chosen  density  a  voxel  is  either  inside  (which  includes  on) 
a  surface  -  its  value  is  at  least  as  great  as  that  chosen  -  or  not 
inside  -  its  value  is  less  than  that  chosen  -  and  there  can  be 
no  path  from  one  to  the  other  that  does  not  cross  the  surface 
frontier.  In  our  case  the  situation  is  identical:  a  voxel  bearing  a 
value  from  a  Laplacian  of  a  Gaussian  is  either  positive  or  non¬ 
positive,  and  a  path  from  a  positive  voxel  to  a  non-positive  one 
must  pass  through  zero.  This  guarantees  that  our  surfaces  will 
bo  closed  -  in  fact  they  will  enclose  positive  or  non-positive  voxel 
values,  depending  on  our  choice.  This  definition  ensures  that 
each  surface  is  a  Jordan  curve  (Jordan  surface)1.  In  essence,  this 
means  that  a  surface  S  divides  f?3  into  2  volumes:  that  inside 
and  that  outside  of  S.  with  no  holes,  knots,  et  cetera,  upsetting 
the  continuity  of  the  surface.  This  allows  a  finite  case  analysis  of 
the  local  surface. 

In  two-dimensional  images,  zero-crossings  will  form  closed  re¬ 
gions  composed  of  pixels  having,  say,  positive  Laplacian  value. 
In  three  dimensions  the  zero-crossings  will  form  closed  volumes 
composed  of  voxels  having  Laplacian  values  of  the  same  sign. 
In  2-1).  we  define  the  region  frontier  to  be  a  contour  composed 
of  two  types  of  edge  element:  a  (’  edgel  is  oriented  vertically 
and  positioned  between  oppositely-signed  horizontally-adjacent 
Laplacian  pixels,  and  a  V'  edgel  is  oriented  horizontally  and  po¬ 
sitioned  between  oppositely-signed  vertically-adjacent  Laplacian 
pixels  (see  Fig.  1).  Interpolating  the  edgel  between  the  two  pixels 
allows  it  to  be  positioned  up  to  ±0.5  units  horizontally  or  verti¬ 
cally  from  the  boundary  between  them.  When  working  with  an 
image  sequence,  we  form  a  three-dimensional  dataset  bv  treating 
the  collection  of  images  as  a  set  of  arrays  of  voxels  each  one  pixel 
deep.  Here,  we  define  the  volume  zero-crossing  frontier  to  be  a 
surface  composed  of  three  types  of  voxel  facets:  I'  and  T  facets 
separate  voxels  horizontally  and  vertically,  as  in  the  2-1)  case, 
while  T  facets  separate  voxels  temporally  they  occur  between 
voxels  at  identical  positions  in  adjacent  images.  Again,  interpo¬ 
lation  allows  us  to  position  the  surface  elements  with  sub-voxel 
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big.  2.  Facet  Labelling  on  3-Space  Surface. 


Fig.  1.  Region  Edgels  in  Two  Dimensions. 

Having  defined  what  our  surface  elements  are.  we  must  demon¬ 
strate  that  we  can  obtain  them  from  the  data.  We  will  do  this 
with  reference  to  both  the  2-D  and  the  3-D  cases,  and  will  use 
both  monocular  and  crosse d-eve  stereo  displays  to  indicate  the 
geometry  behind  the  processing. 

Given  the  local  nature  of  the  above  definition,  where  facets  are 
positioned  between  adjacent  voxels  based  solely  on  their  Lapla- 
rian  values,  it  would  seem  that  detecting  a  facet  should  be  able  to 
be  accomplished  with  very  simple  tests  applied  to  only  local  parts 
of  the  dataset.  In  fact,  everything  necessary  for  determining  the 
local  surface  structure  is  available  in  each  2x2x2  window  of 
the  data.  Consider  a  2  x  2  x  2  set  of  voxels  containing  Laplacian 
values.  There  are  28  different  combinations  of  positive  and  nega¬ 
tive  (more  precisely,  positive  and  non-positive)  f  aplacian  values 
in  these  8  voxels.  If  some  of  these  voxels  are  of  different  sign, 
then  we  have  a  surface  (or  several  surfaces)  passing  through  this 
2  x  2  x  2  part  of  the  data  set4.  Presume  that  we  have  processed 
at  least  one  image,  and  are  now  processing  the  next  in  the  se¬ 
quence.  Our  scanning  is  bottom  to  top,  and  within  that,  left  to 
right  (the  development  for  a  parallel  implementation  differs  in 
detail,  and  has  not  been  completed).  We  label  our  facets  as  indi¬ 
cated  in  the  stereo  display  of  Fig.  2,  where  the  V  facets  are  from 
the  set  (V  T-l  T-2  U-3),  the  V  facets  are  from  (Y  V-l  1-2  l 
and  the  T  facets  are  from  (T  T-l  T-2  T-3).  Ignoring  boundary 
conditions  that  occur  at  the  edges  of  the  frame,  we  ran  see  that 
a  simple  8-bit  index  based  on  the  signs  of  these  voxels  ran  in¬ 
form  us  of  (lie  local  surface  structure.  Addressing  voxels  by  their 
{ url  |  ii,  r.  I  £  (0.  1 )}  relative  indices,  we  can  number  them  from 
0  through  7.  with  0  being  t ho  near  lower  left  (0.0.0).  I  being 
l  lie  far  lower  left  ( 0.  0.  1 ).  7  being  t  lie  far  upper  right  ( 1 .  1 . 1 ).  et 
cetera.  If  voxel  7  is  positive  (termed  on),  and  t lie*  rest  are  non¬ 
positive  (termed  off),  our  X-bit  index  is  200k.  and  we  will  have  a 
single  surface  having  facets  at  I'.  I  and  T  (Fig.  3a).  The  posi¬ 
tion  of  these  facets  will  depend  on  tie*  values  of  voxels  3.  5.  and  0 
with  respect  to  voxel  7.  The  figure  to  the  right  uses  circles  to  en¬ 
code  voxel  state  (borrowed  from  [(>]).  with  filled  meaning  on  and 
empty  meaning  off.  If  voxel  0  is  also  on.  our  index  is  201s  and  we 
will  have  two  surfaces,  one  as  we  had  with  7  alone  on.  as  above, 
anti  the  other  having  facets  at  T-3  I’-fand  7-.?(Fig.  3b).  If  only 
voxels  3.  5  and  7  are  on.  the  signature  is  250s  and  we  will  have 
a  single  surface  witli  facets  at  T-l  T  T-3  Y-l  and  T-l  (Fig.  3c). 
Sliding  the  2  x  2  x  2  window  one  pixel  in  each  direction  will  give 
us  the  adjacent  local  surface  structure  in  that  direction.  Do  this 
throughout  the  entire  volume,  and  we  will  have  a  description  of 
ail  the  surfaces  in  I  lie  dataset. 


4 This  approach  is  an  extension  of  a  twn  tlinu-nsiorul  contour  finder.  ha\ini; 
_  eases,  developed  by  Marimont. 


Fig.  3a.  Local  Surface,  n  =  200s. 


Fig.  3b.  Local  Surface,  n  =  201s- 


Fig.  3c.  Local  Surface,  n  =  250$. 

,\J.  Surface  Conncctu  ity  and  Facet  Adjacency 
Notice  in  t ho  above  that  we  have  defined  neighboring  voxels  as 
being  part  of  a  particular  volume  if  they  have  a  face  in  com¬ 
mon  edges  and  corners  are  not  sufficient.  This  means  that  the 
complement  spare,  made  up  not  of  enclosed  voxels  but  excluded 
voxels,  has  a  connectivity  definition  for  its  voxels  that  includes 
edges  and  corners.  This  may  be  best,  visualized  in  two  dimen¬ 
sions.  as  demonstrated  in  Fig.  1.  There,  with  pixels  numbered  0 
through  3.  two  cases  arc  presented.  First,  with  pixels  1  and  3  on. 
we  have  a  single  contour  between  pixels  (0,  2)  and  pixels  (1.  3), 
and  it  encompasses  a  region  above  that  frontier  (Fig.  da).  The 
complement  region  shares  this  contour,  but  encompasses  the  pix¬ 
els  below  it.  If.  on  the  other  hand,  pixels  0  and  3  are  on,  we  will 
have  two  contours  passing  through  the  2  x  2  window,  one  with 
edgels  between  pixel  0  and  pixels  1  and  2,  the  other  with  edgels 
belwomi  pixel  and  pixels  1  and  2  (Fig.  lb).  The  first  region 
encoin passes  pixel  0  and  the  other  encompasses  pixel  3.  The  two 
pixels  are  not  judged  locally  to  be  part  of  the  same  region  since 
they  do  im!  share  a  face.  The  complement  region,  however,  on* 
compares  pixels  1  and  2.  in  fact  squeezing  itself  bet  ween  the  two 
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Fig.  4a.  Pixels  1  and  3  on.  Fig.  lb.  Pixels  0  and  3  on.  Fig.  4c.  Pixels  1  and  2  on. 


other  regions.  If  a  reverse  on  sense  were  used  (negative  rather 
than  positive),  we  would  not  have  an  identical  .segmentation. 
Rather,  pixels  1  and  2  would  be  parts  of  different  regions,  and 
pixels  0  and  3  would  be  part  of  the  complement  region  (Fig.  4c). 
This  means  that  the  choice  of  enclosure  sign  aiTects  more  than 
just  the  sense  of  the  regions;  it  also  affects  their  structure. 

Being  consistent  in  this  definition  of  adjacency  is  fairly  crucial  to 
maintaining  a  coherent  surface  description  in  3-space.  Lorensen 
it  at.  in  [6]  describe  an  independently  developed  algorithm  for 
surface  rendering  based  on  the  same  use  of  voxel  sign  signatures. 
In  their  attempts  to  reduce  the  complexity  of  the  coding,  they 
fold  the  mapping  about  many  of  its  symmetries.  In  general,  any 
folding  that  is  not  undone  in  a  later  mapping  would  destroy  the 
spatial  coherence  of  our  weaving.  They,  however,  do  not  seek  a 
coherent  surface  description,  settling  instead  for  a  set  of  trian¬ 
gular  facets  appropriate  for  rendering.  One  of  fiiet'r  foldings  is 
to  map  the  signature  to  its  minimum  with  respect  to  number  of 
voxels  on.  P  fi vo  voxels  are  on  with  a  signature  of  37*.  this  will  be 
recoded  as  340.*.  having  the  previous  five  voxels  now  o/fand  only 
three  voxels  on.  The  effect  is  to  treat  a  pattern  of  on/ off  voxels 
as  possibly  a  pattern  of  off/ on  voxels,  complementing  their  true 
-en-c.  The  2- D  example  of  Fig.  Ic  above  shows  that  this  symme¬ 
try  mapping  is.  in  fart,  not  valid  -  it  destroys  the  coherence  of  the 
description.  In  3-D  this  is  most  disruptive  at  saddle  points.  The 
displays  in  [G]  are  dense  enough  that  tiiis  isn’t  noticed  visually. 

'now.  individual  facets  are  related  to  those  on  the  surface  around 
them  in  the  same  way  that  individual  edgels  in  2-D  are  related  to 
the  edgels  adjacent  to  them  on  a  contour.  2-1)  edgels  have  two- 
connectivity  they  have  adjacent  edgels  to  their  If  ft  and  right 
I'wjfh  respect  to  some  sense):  3-1)  facets  have  four-connectivity 
i  hey  have  adjacent  facets  up.  down,  and  to  their  If  ft  and  right 
f  again  with  respect  to  some  sense).  Adjacency  of  facets  requires 
a  shared  edge.  For  example,  when  voxel  7  alone  is  on  (having 
a  signature  of  200*).  the  3  faces  involved  {(' .  W  and  7  )  are  set 
to  indicate  their  neighbors:  /’  and  /  are  to  each  other’s  h ft  and 
right,  respectively,  !'  and  l  are  to  «  acli  other’s  right  and  flown 
re-pert ively.  and  \  and  7  are  to  each  other’s  down  and  down. 
respectively.  I  he  sense  is  with  re-perl  to  looking  in  at  the  un 
voxel  from  outside  the  fare.  In  general,  facet  connectivity  i-  of 
order  exactly  4.  Karli  of  the  t  edge-  of  a  facet  will  la*  shared  by 
ju-t  one  other  facet,  and  only  when  a  vox*-)  is  on  the  boundary 
of  ’he  da*  a-et  will  t  here  fail  :■>  be  f.m  t  -  to  -h.iie  .,  ]  I  « »f  i»  -  e.|o, 
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1  lie  foding  of  a  local  topographic  -ign.it  ure  al-o  enable-  very 
efficient  implementation  of  -urface  propeity  computation.-,  that 
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tal  traversal  of  ail  the  surfaces  once  completed.  J  ir-f  among 


these  is  the  maintaining  of  distinction  between  surfaces  that  are 
disjoint.  Each  surface  is  given  an  identifying  index  when  first 
encountered,  and  this  index  is  given  to  each  face  that  is  later 
found  to  be  part,  of  that  surface.  Only  certain  patterns  ot '  on/ off 
voxels  can  arise  from  the  uncovering  of  a  new  surface:  ignoring 
boundary  conditions5,  the  local  configuration  has  to  be  such  that 
it  includes  an  on  convex  corner  at  voxel  7.  Not  all  such  corners 
are  actually  new  surfaces,  however,  as  the  local  information  is 
insufficient  to  indicate  that  the  surface  to  which  these  facets  be¬ 
long  has  already  been  encountered.  This  can  only  be  detected 
when  the  two  surfaces  (the  earlier  and  that  containing  the  new 
convex  corner)  come  together  in  the  2  x  2  x  2  window.  Facet 
linking  provides  for  the  combining  of  surfaces  when  their  fron¬ 
tiers  merge.  With  this  approach,  disjoint  surface  computation  is 
primed  by  precompilation  into  the  case  dispatcher  and  finalized 
on  the  fly  by  the  facet  linker:  at  each  stage  of  the  processing  the 
most  concise  surface  description  possible  is  the  one  main'ained. 

Another  measure  that  is  similarly  encoded  in  the  voxel  signature 
is  the  determination  of  surface  bounding  cartesian  volume.  As 
was  the  case  with  the  charac*-  . ization  of  new  .surface  arrival,  the 
voxel  am  figurations  that  contribute  (or  may  contribute)  to  an 
adjustment  on  the  enclosing  cartesian  frame  for  a  surface  ran  be 
characterized  by  their  local  topography.  A  convex  corner  with 
voxel  7  fin.  as  above,  may  affect  the  lower  limits  in  all  of  n  v  and 
t  for  the  surface.  .Similarly,  voxels  2  and  3  on  can  mean  an  altered 
upper  i/  limit  and  an  altered  lower  c  limit  for  their  surface  (see 
fig.  2).  hut  cannot  affect  the  /  limits  (again,  ignoring  boundary 
conditions).  4  lies*'  test-,  are  compiled  into  t he  case  dispatcher  as 
appropriate. 

This  attribute  coding  and  incorporation  into  the  weaving  pro¬ 
cess  does  two  things:  it  eliminate-  a  later  land  very  expensive) 
sequential  search;  and  it  minimizes  the  time  spent  overall  by  re¬ 
stricting  t  he  calculations  to  only  those  parts  of  the  surface  where 
they  are  necessary.  Fig.  •*>  sketches  the  local  voxel  configuration 
for  a  case  having  signature  value  of  2b7s  (a  -addle  point). 
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The  act  of  creating  a  new  facet  also  maintains  gradient  mean 
and  variance  statistics  for  each  surface.  These  can  be  of  use  in 
filtering  surfaces  by  their  significance  -  a  low  mean  indicates  a 
subtle  surface  more  likely  to  be  spurious. 

2  ./.  Economics  in  the  Construction 

An  advantage  of  the  sequential  processing  -  lop  to  bottom  and 
left  to  right  -  forced  on  us  by  using  a  sequential  machine,  is  that 
we  know  when  processing  that  certain  facets  i  the  2x2x2 
window  have  already  been  created  and  linked  tt  gether.  In  fact, 
at  any  given  ( u,v,t ),  we  are  only  creating  facets  of  type  V,  V, 
and  T.  and  linking  facets  that  are  adjacent  across  the  three  edges 
of  that  octant.  In  Fig.  5  only  two  new  facets  and  four  links  were 
need  in  constructing  the  fairly  complex  local  surface  for  that  case. 
All  other  facets  and  links  were  created  earlier  (refer  to  Fig.  2.). 

Also  important  to  note  is  that  the  state  of  voxel  0  affects  neither 
the  face  and  surface  creation  nor  the  face  linking  operations. 
There  are  no  U,  V  or  T  facets  in  that  octant,  nor  links  to  any  of 
these  facets.  In  effect  removing  one  bit  from  our  signatures,  this 
lets  us  halve  the  state  space.  Exploiting  this  coding  efficiency, 
t  he  surface  construction  code  for  case  267$  of  Fig.  5  is  also  appro¬ 
priate  for  the  case  of  signature  266g.  as  shown  in  Fig.  6.  Notice 
that  this  is  not  a  saddle  point,  and  locally  involves  not  one  but 
two  surfaces. 
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Fig.  6.  Voxel  Configuration.  266g. 

Fig.  7  shows  a  local  surface  configuration  differing  from  that  of 
Fig.  5  by,  again,  a  single  sign  bit  -  voxel  5  here  is  off  -  and 
this  gives  it  a  signature  of  2“27«.  This  is  identical  topologically 
to  the  local  surface  of  Fig.  0.  but  is  oriented  differently.  This 
slight  change  leads  to  a  rather  different  set  of  instructions  for  the 
weaver,  including  the  creation  of  a  new  surface. 


Fig.  7.  Voxel  Configuration.  227$. 

2.5.  Representational  Duality 

A  surface  representation  composed  of  the  facets  separating  on 
and  off  voxels,  as  indicated  in  Fig.  2.  is  one  way  of  viewing  the 
results  of  the  Weaving  Wall.  This  is  the  view  in  the  cuberille 
model,  although  without  the  interpolation  (which  in  effect  al¬ 
lows  our  facets  to  be  of  varying  size).  Each  planar  facet  has 
integral  coordinates  along  2  axes,  an  interpolated  position  along 
the  third,  and  an  estimate  of  surface  normal  computed  by  the 
3-D  gradient  convolutions.  An  enriched  view  of  the  surfaces  is 
obtained  if  one  chooses  to  consider  them  in  a  dual  sense,  the 
principal  elements  of  which  are  the  nonplanar  patches  formed  by 
arcs  joining  adjacent  facets.  The  shape  of  the  patches  can  be  esti¬ 
mated  by  interpolation  of  their  vertex  (facet)  normals.  Since  the 
boundaries  arc  oriented  along  the  three  principal  axes,  this  gives 
a  cartesian  ruled  surface  representation,  as  indicated  in  Fig.  8. 
showing,  in  crossed-eve  stereo  form,  a  dart-shaped  cone  of  circu¬ 
lar  cross-sections.  Patches  can  have  from  3  to  7  linear  arcs  on 
their  boundaries:  3  is  a  triangle  (Fig.  9  right):  I,  o  and  most  6's 
are  simple  non-planar  shapes  (Fig.  9  right):  some  (3's  and  all  7’s 
are  saddle  points  (Fig.  9  loft ). 

3.0.  Locating  Facets 

We  have  discussed  the  structure  of  our  surfaces  and  their  con¬ 
struction  from  local  surface  elements,  but  have  not  described 
how  we  locate  and  parameterize  t host'  individual  elements  -  the 
facets.  Our  data  is  three-dimensional,  two  being  spatial  with 
the  third  being  temporal,  and.  in  determining  the  facet  positions 
and  orientations,  we  convolve  this  data  with  a  battery  of  three- 
dimensional  filters15.  Facet  orientation  is  determined  by  com¬ 
puting  the  three  partial  directional  derivatives  about  each  voxel. 
We  compute  these  derivatives  by  forming  a  partial  derivative  of 
a  chosen  3-D  Gaussian  and  convolving  it  in  the  appropriate  di- 
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rection  at  each  voxel.  Computing  facet  position  is  accomplished 
in  a  similar  manner:  we  compute  the  sum  of  the  second  partial 
directional  derivatives  of  the  Gaussian  at  each  voxel,  and  position 
the  facets  by  linear  interpolation  at  zeroes  of  these  values. 

The  Gaussian  is  a  separable  filter,  so  we  can  obtain  the  three 
directional  derivative  components  as: 

duG(I)  =  G'„  0  G„  0  G'<  0  / 

dvG(I)  =  Gu  3  G'v  ®  G<  ®  / 

dtG(I)  =  Gu  ®  G„  0  G't  0  / 

where  0  denotes  convolution,  Gu  is  a  1  x  k  vector  of  Gaussian 
weights,  Gv  is  a  A:  x  1  vector  of  Gaussian  weights,  and  Gt  is  a 
1  x  k  vector  of  Gaussian  weights,  and  is  applied  in  the  third 
dimension  over  a  set  of  k  images,  centered  at  image  The  G's 
are  derivatives  of  the  appropriate  G's. 

The  Laplacian  is  also  separable  (Huertas  [5]  and  Marimont  (who 
developed  this  independently,  and  did  it  for  higher  dimensions)) 
so  it  can  be  computed  as  the  sum  of  the  second  partial  directional 
derivatives: 

L(G(I))  =  dZG(I)  +  dlG(I)  +  9?G(I) 

=  G"  0  G'v  0  G’t  0  I  -F  GT  0  G'v  0  Gt  0  I 
+  Gu  @  G'v  S  G"  0  / 

where  the  G'"s  are  second  derivatives  of  the  appropriate  G's. 

We  can  use  separability  to  recompose  these  convolutions  econorn- 


icallv  as  follows: 

GulD 

= 

Gu  0  /, 

(1) 

GV(I) 

= 

Gv  0  / , 

(21 

GuA  I ) 

= 

Gu  3)  G„( / )  =  Gv  8  Gul  I ) 

(3) 

GutU) 

= 

G’u  0  <?,(/)  =  Gj  0  Gn( /) 

(4) 

Gvt(I) 

= 

Gu  0  G'i(/ )  =  G'<  @  G„l!) 

(3) 

OuGlI) 

= 

G’u  0  Gu  ®Gt®r  =  G’u  ®  Gutll) 

(6) 

OuGU) 

= 

Gu  8  G’u  0  Gt  0  I  =  G'u  8  GutU) 

(7) 

0,GU) 

= 

Gu  3  Gv  3  G1',  S  1  =  G'tQGuAD 

(«) 

OlG(I) 

= 

G'u  8  Gv  8  G'i  ©  /  =  G'u  8  Gutl  1 ) 

(9) 

i)lG(I) 

G „  0  G'[.  8  G t  •;  I  =  G'f  ®  Gutll) 

(10) 

o’iaii) 

Gu  G,. :  G'I  •:  /  -  G"  G'„,.{  / ) 

(11) 

i(G(D) 

- 

0lGU)  +  0lG(l)  +  i)?G{l) 

5.7.  Propagating  Sequence  Constraints 

Returning  for  a  moment  to  our  sequence  analysis  work,  we  see 
that  an  indexing  stategy  similar  to  the  above  can  be  used  to 
turn  a  complex  operation  on  the  spatiotemporal  surface  into  a 
fairly  simple  operation  when  carried  out  in  parallel  locally  at  the 
facet  level.  The  task  is  the  tracking  of  features  between  frames. 
It  is  carried  out  locally  by  intersecting  a  pencil  of  planes  with 
the  evolving  spatiotemporal  surfaces  and  propagating  tracker  in¬ 
formation  along  these  paths  (see  (2]).  Fig.  9  left  show's  a  set 
of  tracker  paths  on  the  local  surface  for  the  signature  case  267s 
shown  above.  Tracker  information  is  passed  from  the  trailing 
to  the  leading  edge  of  the  patch  along  the  lines  drawn,  each, 
in  effect,  tracking  a  feature  through  time  on  the  spatiotemporal 
surface.  Here,  no  folding  of  the  signature  is  possible,  and  all  25G 
cases  must  be  handled  explicitly.  Fig.  9  right  shows  the  situation 
for  the  signature  case  266g,  dealing  with  two  surface  patches. 


Fig.  9.  Epipolar  Intersections,  267s-  Epipolar  Intersections,  266s- 
2.8.  Rendering  Surfaces 

Rendering  surfaces  for  display  is  generally  a  simpler  procedure 
than  the  weaving  of  coherent  surface  descriptions,  and  can  be 
seen  as  an  adaptation  of  the  surface  patch  dual  representation 
mentioned  above.  For  rendering,  we  must  tesselate  the  patches 
into  triangles.  Given  our  signature  mapping  technique  as  above, 
this  is  quite  straightforward,  even  for  saddle  points.  Fig.  10  left 
shows  the  triangulation  produced  for  the  case  of  a  signature  of 
2678.  The  order  of  node  traversal  about  the  patches  uses  a  right- 
handed  rule  with  the  thumb  pointing  out  of  the  surface  (refer  to 
Fig.  2  and  Fig.  5).  Fig.  10  right  shows  the  triangulation  produced 
for  a  surface  patch  having  signature  of  266s.  where  two  surface 
patches  are  involved. 


giving  us  eleven  one-dimensional  convolutions  (as  numbered)  for 
the  full  set  of  required  gradient  and  Laplacian  three-dimensional 
convolutions  (those  underlined). 

In  our  experimental  development  we  precompute  these  derivative 
images  for  a  sequence  by  loading  in  all  images  at  once  and  doing 
a  massive,  optimized  set  of  convolutions.  Another  version  of 
the  processor,  meant  for  true  sequential  operation,  maintains  a 
pipeline  of  k  images,  centered  on  the  current  frame  t,  carrying 
out  the  appropriate  F  and  F  convolutions  on  the  newest  image 
in  flu1  pipeline,  and  appropriate  T  convolutions  on  the  current 
centered  image  /.  Lite  Weaving  Wall,  at  frame  /.  requires  the 
gradients  for  frame  t  and  the  I. apiarians  for  both  frame  /  and 
frame  t  —  I . 
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Fig.  It).  Rendering  Patches,  267„.  Rendering  Patches.  2(>6S. 

Actual  rendering  is  done  later,  as  desired,  using  the  computed 
facet  coordinates  for  spatial  position,  and  the  facet  gradients  as 
vertex  normals.  Triangle  shading  is  based  on  bilinear  interpo¬ 
lation  of  these  gradients.  Since  disjoint  surfaces  are  represented 
as  distinct  objects,  color  coding  in  the  display  and  independent 
manipulation  of  the  surfaces  is  readily  attainable. 
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2.9.  Considerations  for  Parallel  Implementation 
Alterations  in  the  assumptions  underlying  these  processes  would 
have  to  be  made  to  enable  them  to  generate  code  for  parallel 
implementations,  but  these  would  be  minor  rather  than  radical 
changes.  The  advantages  of  parallel  implementation  are  quite 
apparent:  we  could  divide  our  processing  time  by  a  large  fraction 
of  the  number  of  pixels  in  an  image  (upwards  of  65,000  for  a 
set  of  256  x  256  images).  Currently  the  weaver  operates  on  a 
Symbolics  3600  at  about  a  lKHz  voxel  rate,  suggesting  a  parallel 
implementation  might  operate  at  hundreds,  perhaps  thousands 
of  frames  per  second. 

3.  WEAVING  WALL  APPLICATIONS 

Some  exciting  secondary  applications  of  this  research  arose  di¬ 
rectly  from  the  development  of  the  Weaving  Wall  spatiotempo- 
ral  surface-building  process.  This  process  was  designed  to  satisfy 
our  needs  for  spatially-coherent  incremental  tracking  over  an  im¬ 
age  sequence.  These  characteristics  make  it  particularly  useful 
for  other  applications  where  coherent  descriptions  of  nearly  con¬ 
tinuous  three-dimensional  data  is  sought.  Most  obvious  among 
these  is  the  construction  of  surface  models  from  CT  (computed 
tomography)  and  other  tomographic  (eg.,  magnetic  resonance, 
ultrasound)  medical  data.  We  have  also  been  applying  it  to  visu¬ 
alization  problems  in  studying  the  behaviour  both  of  images  with 
respect  to  spatial  resolution  and  analytic  functions  of  dimension 
higher  than  three. 

3.1.  Medical  Tomographic  Data 

Although  we  developed  it  for  spatiotemporal  analysis,  we  have 
applied  the  algorithm  to  surface  reconstruction  from  CT  slice 
data  -  data  which  has  to  temporal  component,  but  is  totally 
spatial.  In  this  we  use  tissue  density  rather  than  Laplacian  val¬ 
ues.  maintaining  the  3-D  gradients  for  surface  normal  calcula¬ 
tions.  Fig.  11  shows  the  evolution  of  surfaces  judged  to  be  ‘bone' 
(greater  than  240  units)  in  a  70  x  30  window  of  a  52  image  CT 
dataset.  You  can  see  the  incremental  nature  of  the  surface  de¬ 
velopment. 

Fig.  12  shows  another  CT  case  study,  this  showing  front  and  side 
stereo  views  of  the  skin,  while  Fig.  13  shows  several  stereo  views 
(side,  front,  and  top)  of  the  denser  bone  tissue.  The  patient  has 
been  undergoing  cranial  reconstruction.  These  are  from  -16  slices 
of  roughly  120  x  120  CT  images.  Interslice  spacing  is  five  times 
the  slice  resolution,  which  leads  to  a  rather  chunky  appearance 
vertically.  The  slice  sampling  was  increased  for  a  section  of  the 
dataset  in  the  area  about  the  ear,  and  it  would  appear  from  the 
horizontal  band  there  that  the  patient  moved  slightly  as  the  scan 
adjustments  were  being  made.  There  was  motion  also  in  the  area 
of  the  jaw,  and  X-ray  reflections  from  metal  teeth  fillings. 

Although  the  surface  display  aspects  of  this  technique  are  evi¬ 
dently  quite  worthwhile,  bear  in  mind  that  the  primary  repre¬ 
sentation  is  a  surface  model,  with  all  the  connectivity  appropri¬ 
ate  for  full  model-based  computation  (eg.,  finite  element  analysis, 
symmetry  mappings,  elastic  deformation  operations,  et  cetera). 

Images  in  Scale-Space . 

One  of  the  standing  issues  in  the  motion  sequence  analysis  work, 
and  in  computer  vision  in  general,  is  the  selection  of  a  Gaus¬ 
sian  to  be  ifie  basis  for  detection  operations:  in  effect,  selecting 
the  scale  of  analysis.  We  have  done  some  limited  experimenta¬ 
tion  with  surface  building  where  the  third  dimension  is  Gaussian 
scale  (rr).  Building  a  surface  rtf  the  evolution  of  an  image  as  its 


Fig.  11.  Slice  Evolution  of  Spine 


resolution  varies  provides  a  vivid  picture  of  how  Wilkin's  1-1) 
scale-space  studies  [s]  can  extend  to  images.  Fig.  16  shows  a 
surface  constructed  at  the  3  1)  Laplacian  zeros  obtained  by  ap¬ 
plying  a  battery  of  increasingly  larger  Gaussinns  to  the  image 
of  Fig.  I  I.  This  shows  eight  images,  each  diliering  in  n  by  0.5 
from  the  one  before.  Fig.  15  loft  shows  the  zero-crossings  of  the 


■  one  before.  Fig.  15  left  shows  tin 
Gaussian,  and  its  right  shows  the 
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largest.  Gaussian  -  these  are  the  extremes  of  which  Fig.  16  rep¬ 
resents  the  continuum.  The  most  stable  representation  of  a  fea¬ 
ture  in  this  space  may  be  at  that  part  of  its  evolution  exhibiting 
minimum  spatial  velocity  with  respect  to  Ait.  The  connected 
scale-space  surface  makes  this  stability  explicit. 

In  the  near  term  we  will  be  modifying  the  Weaving  Wall  to  pro¬ 
duce  four-dimensional  surfaces,  where  the  first  three  are  the  spa¬ 
tial  and  temporal  as  before,  and  the  fourth  is  Gaussian  scale. 
Our  intention  is  to  use  the  most  stable  representation  of  a  fea¬ 
ture  as  its  instantiation  to  be  tracked.  The  linear  estimators  will 
then  use  these  more  appropriate  a  values  in  determining  obser¬ 
vation  weights  and  in  estimating  the  resulting  spatial  precisions. 
Tracking  will  be  occurring  at  all  scales  at  once.  In  other  ap¬ 
plications.  such  as  the  processing  of  magnetic  resonance  images, 
we  will  be  attempting  to  use  the  stability  of  a  contour  in  this 
scale  dimension  as  a  measure  to  combine  with  gradient  strength 
in  estimating  both  contour  scale  and  significance. 

Another  motivation  for  four-dimensional  surface  reconstruction, 
still  lying  ahead  in  our  work,  is  its  applicability  to  describing 
the  evolution  of  three-dimensional  (rather  than  two-dimensional) 
events  over  time.  Rather  than  having  a  fourth  dimension  being 
.sivuV.  we  couid  make  ii  time,  as  we  have  in  our  tracking  work. 
Our  first  motivating  force  in  this  is  a  collaborative  effort  we  are 
beginning  in  representing  the  time-dependent  geometry  of  ma¬ 
terials  as  tliev  fracture.  An  equally  exciting  application  lies  in 
capturing  the  temporal  relationships  in  3-D  tomographic  data 
building  dynamic  models  of,  for  example,  beating  hearts  or 
rotating  vertebrae. 

■1.3.  Analytic  Functions 

Thf  Wtavimj  Wall  has  proven  to  be  a  useful  tool  in  object  rep¬ 
resentation  studies.  Our  research  group  studies  representation 
issues  for  object  modelling,  and  has  developed  a  modelling  fa¬ 
cility  based  on  Superquadrics  [7].  Abstracting  from  this  repre¬ 
sentation,  Hanson  [-1]  has  developed  a  higher-dimensioned  hy¬ 
perquadric  formulation,  a  three-dimensional  projection  of  which 
gives  a  superset  of  tile  superquadric  primitives.  Experimenting 
with  these  higher  dimensioned  objects  is  complicated  by  the  im¬ 
possibility  of  viewing  in  these  higher  dimensions.  To  facilitate 
this,  we  display  sequences  of  three-dimensional  projections  of 
these  n-dimensional  objects  (n  >  3).  and  use  the  Weaving  Wall 
in  this  rendering.  Fig.  17  shows  a  sequence  of  such  projections 
through  a  four-dimensional  surface.  Viewing  such  frames  as  a 
sequence  gives  us  insight  into  a  surface's  four-dimensional  struc¬ 
ture. 

Other  applications  of  the  surface-building  process  described  here 
include  representation  of  surfaces  from  other  two-dimensional 
sensing  domains  (rr/.,  ultra-sound,  geology),  display  of  surface 
fracture  reconstructions  (2-1)  over  time),  and  the  colorizat ion  of 
black  and  white  film.  We  should  have  preliminary  results  from 
t  lie  fiist  two  areas  shortly.  In  general,  this  can  be  used  in  any 
application  where  a  depiction  or  computational  representation  is 
desired  of  the  evolution  of  a  two-dimensional  pattern  varying  in 
a  continuous  manner  over  a  third  dimension,  be  that  dimension 
time,  space,  viewing  position.  re..oiuiion.  or  whatever. 


Fig.  17.  Four  3-D  Projections  through  4-D  Surface. 


A.  CONCLUSIONS 

The  Weaving  Wall  was  a  necessary  development  for  our  continu¬ 
ing  research  in  sequence  analysis.  Tito  structures  it  produces  pro¬ 
vide  for  our  current  needs  in  tracking  features  on  the  spatiotein- 
poral  surface,  and  seem  also  to  be  most  appropriate  for  camera 
solving  and  for  our  intended  extension  of  the  analysis  to  non¬ 
linear  camera  paths.  The  form  of  its  implementation  has  enabled 
us  to  implement  many  complex  geometric  three-dimensional  im¬ 
age  operations  at  a  very  local  level,  and  to  maintain  important 
surface  properties  at  minimal  cost.  The  applications  to  medical 
imaging,  display  of  higher-dimensioned  analytic  functions,  scale- 
space  computations,  et  cetera,  suggest  that  we  have  captured.  In 
our  surface  building  process,  a  powerful  and  general  tool.  We 
will  be  developing  it  further  in  these  and  other  directions,  and 
expect  we  will  find  similar  variety  in  application  for  its  higher 
dimensioned  counterparts. 
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Abstract.  Computer  models  of  objects  are  necessary 
for  a  variety  of  tasks  including  model-based  vision. 
Techniques  are  presented  to  construct  face-edge-vertex 
models  directly  from  intensity  images.  A  correspond¬ 
ing  technique  is  developed  to  build  face-edge-vertex 
models  from  range  images.  The  results  are  compared. 


1.  Introduction 

With  the  recent  increased  interest  in  model-based 
vision6, 8,S’ 13  has  appeared  a  concomitant  interest  in 
automated  model  creation.1,7,10  There  is  also  considerable 
interest  in  acquiring  models  to  be  used  for  mission  rehear¬ 
sal,  robotics,  and  simulation  purposes.  This  paper 
describes  automatic  procedures  for  building  face-edge- 
vertex  models  directly  from  image  data.  Techniques  are 
shown  to  be  applicable  and  effective  both  to  intensity  and 
to  range  data. 

Computer  models  must  be  geometrically  accurate. 
Relative  positions  of  surfaces  on  the  model  should  not  be 
very  different  from  those  of  the  actual  object.  The  models 
must  also  be  topologically  sound,  representing  the  object 
by  a  single  orientable  surface.  If  the  models  obtained  from 
individual  views  are  indeed  sound  and  the  view  models 
may  be  placed  in  a  common  coordinate  system,  a  boolean 
intersection  of  the  view  models  provides  an  integrated 
multiple-view  model.  This  is  the  essential  technique  for 
building  models  from  both  range  and  intensity  image  data. 

2.  Models  from  Intensity  Data 
2.1  Overview  of  Technique 

Face-Edge-Vertex  models  are  here  derived  directly 
from  one  or  more  ordinary  intensity  images.  Objects  or 
the  individual  faces  of  the  object  are  represented  by 
(closed)  boundaries  or  1 -cycles  in  the  viewplane.  Edges 
are  detected  in  the  images  with  the  Canny2  edge  detector 
coupled  to  an  adaptive  threshold  zero-crossing  following 
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procedure  that  is  effective  in  actually  finding  these  1- 
cycles.  Large  dangling  edge  chains  are  connected  to 
nearby  edges  when  they  are  quite  close.  Edge  lines  are  fit 
to  approximate  the  contour  followed  by  the  1 -cycles. 

The  parameters  of  the  imaging  camera  are  used  to 
generate  bounding  volumes  for  the  surfaces  which  gave 
rise  to  the  'eatures  in  each  image.  Assuming  the  pinhole 
model  i  oroperties  are  the  orientation,  focal  point  posi¬ 
tion  ui  length  of  the  camera.  Using  this  informa¬ 

tion  t!  /  les  comprising  the  features  in  the  image  can 
be  ,.d  to  form  volumes  {2-cycles  or  Blocks). 

View  volumes  are  intersected  with  volumes 
obtained  from  other  views  to  obtain  the  potentially  solid 
blocks  corresponding  to  objects  in  the  scene.  This  inter¬ 
section  process  usually  produces  an  ambiguous  set  of 
blocks.  Each  block  represents  a  bound  on  the  position  of 
any  surface  which  might  lie  within  that  volume.  By  gen¬ 
erating  repeated  views,  view  solids,  and  intersections,  solid 
models  may  be  generated  which  closely  approximate  the 
object  being  viewed. 

2.2  Relation  to  Previous  Research 

The  approach  described  here  is  related  to  the  work 
of  Chien  and  Aggarwal.^  The  chief  difference,  however,  is 
that  the  method  described  here  generates  Face-Edge- Vertex 
models  from  multiple  edge  cycles  in  one  or  more  intensity 
images.  Chien  and  Aggarwal  require  that  an  image  be  seg¬ 
mented  into  object  and  background.  This  produces 
silhouettes  which  are  first  convened  into  quadtrees 
representing  orthogonal  views  and  then  assimilated  into 
octrees.  Chien  and  Aggarwal  suggest  that,  in  principle, 
their  algorithm  extends  to  non-orthogonal  views.  While 
the  octree  may  be  completed  from  non-orthogonal  views, 
the  procedure  is  more  complicated,  involving  ray-tracing 
techniques.4  Chien  and  Aggarwal  point  out  that  a  large 
class  of  objects  are  well-modelled  by  the  process.  In  par¬ 
ticular,  objects  with  curved  surfaces  are  represented  by  a 
faceted  polyhedral  model. 

One  disadvantage  of  the  octree  model,  however,  is 
that  any  planar  surface  which  is  not  parallel  to  one  of  the 
octree  axes  will  also  be  broken  up  into  many  orthogonal 
facets  -  all  lying  within  a  small  region  of  the  actual  sur¬ 
face  plane.  The  Face-Edge-Vertex  model  used  here  shows 
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no  a-priori  preference  to  any  view  directions.  The  algo 
rithm  is  the  same  whether  views  are  orthogonal  or  occur  at 
random  known  angles.  Visible  planar  surfaces  will  be 
resolved  into  either  one  planar  face  or  a  small  number  of 
planar  surfaces  that  are  nearly  coplanar. 

In  the  method  described  here,  multiple  volumes 
are  created  to  correspond  to  the  multiple  edge  cycles  in  the 
image.  These  volumes  must  then  be  disambiguated  into 
solid  or  empty  volumes.  Whether  the  volumes  are  solid  or 
empty  can  often  be  determined  from  multiple  views  of  the 
same  scene  from  different  directions.  There  are  other  tech¬ 
niques  that  may  be  useful,  as  will  be  seen  later. 

This  work  is  also  somewhat  related  to  the  work  of 
Baker,1  and  especially  that  of  Wesley  and  Markowsky.12 
The  Markowsky  and  Wesley  projection  fleshing  algorithm 
solves  a  similar  model  construction  problem.  The 
Wesley-Markowsky  algorithm  is  designed  to  produce  solid 
models  from  wire  frame  projections.  The  solid  model  is 
constructed  by  relying  on  an  existing  wire  frame  fleshing 
algorithm.9  Unfortunately,  this  algorithm  will  succeed  only 
for  a  complete  wire  frame.  The  algorithm  is  well-suited  to 
multiple  orthographic  views  of  wire  frames,  with  no  hidden 
line  removal  allowed.  When  these  conditions  are  not  met, 
significantly  more  views  will  be  necessary  to  produce  the 
vertex  correspondences  required  to  complete  the  wire¬ 
frame.  While  such  conditions  are  relatively  easy  to 
achieve  in  a  CAD-CAM  environment  (the  intended  arena 
for  the  algorithm),  a  different  approach  must  be  used  when 
dealing  with  image  data.  There  are  several  other  problems 
in  applying  the  Markowsky-Wesley  algorithm  to  the  con¬ 
struction  of  solid  models  from  intensity  images.  In  the 
projection  fleshing  algorithm,  the  geometry  of  the  input 
data  is  assumed  to  be  exact.  That  is,  the  correspondence 
between  vertices  from  view  to  view  must  be  exact.  This 
will  almost  never  be  the  case  for  actual  image  data.  Also, 
Markowsky  and  Wesley  discuss  the  topological  aspects  of 
their  algorithm  as  it  relates  to  curved  surfaces,  but  do  not 
provide  any  comparable  algorithm  which  will  behave  suit¬ 
ably  with  curved  surfaces.  The  algorithm  described  in  this 
paper  produces  a  solid  polyhedral  model  from  each  inten¬ 
sity  image,  even  if  the  object  has  curved  surfaces. 
Approximating  curved  edges  in  the  image  by  line  segments 
will  merely  result  in  an  approximate  model  —  rather  than 
no  model  at  all.  We  show  that  repeated  views  of  an  object 
iteratively  produces  a  faceted  model  approximating  curved 
surface. 

2.3  Generation  of  View  Volumes 

Volumes  are  generated  using  the  following  pro¬ 
cedure:  First,  an  image  is  Gaussian  smoothed  and  pro¬ 
cessed  with  a  suitable  edge  detector  (e.g.  Canny2  ).  Then, 
the  edges  are  converted  to  1 -chains  (sets  of  edges)  by 
selecting  junctions  and  points  of  high  curvature  (when  the 
edges  are  viewed  as  curves)  to  serve  as  vertices.  It  is  often 
the  case  that  the  result  of  applying  an  edge  detector  to  an 
intensity  image  produces  gaps  in  what  would  otherwise  be 
closed  curves  consisting  of  edges.  The  Canny  process  has 
been  modified  to  attempt  to  follow  weaker  zero-crossings 


in  the  vicinity  of  several  edge  events.  Gaps  are  produced 
by  regions  of  low  contrast,  artifacts  of  lighting  and  imag¬ 
ing.  To  counteract  this  effect,  the  edges  are  extended  a 
certain  small  length  so  that  they  may  intersect  other, 
nearby  edges.  The  resulting  complex  of  edges  is  now 
pruned  so  that  only  1 -cycles  remain.  Each  1 -cycle  is  then 
extruded  to  form  a  2-cycle.  Each  edge  of  the  1 -cycle  is 
used  to  form  a  back-projected  face.  The  front  and  rear 
faces  of  the  2-cycle  are  simply  planes  set  at  arbitrary  for¬ 
ward  and  rear  distances  from  the  camera.  The  measured 
camera  parameters  (e.g.,  focal  length)  are  used  to  deter¬ 
mine  the  shape  of  the  2-cycle  by  using  the  inverse  perspec¬ 
tive  function.  The  perspective  function  maps  points  in 
E\x, y?)  to  points  on  the  viewplane  ( X,Y )  as  follows: 

X  =J±- 
X  t-z 

where  /  is  the  camera  focal  length.  Differentiating  x  and 
y  with  respect  to  z  yields: 

dx  =  X 

dz  f 

itJL 

dz  f 

This,  along  with  the  location  of  the  point  in  the  viewplane, 
completely  specifies  the  ray  on  which  the  actual  point  in  £J 
corresponding  to  a  viewplane  point  must  lie.  Two  values 
of  z  are  chosen  (z0,Z|)  so  that  they  fully  enclose  the  scene 
volume  in  the  z  direction.  These  z  values  fix  the  positions 
of  the  front  and  rear  faces  of  the  2-cycles.  The  2-cycles 
are  finally  transformed  into  a  common  coordinate  system 
and  intersected  with  each  other  to  get  approximations  to 
the  objects  found  in  the  scene. 

2.4  Experimental  Results 

Figures  1  shows  one  intensity  view  used  to  create 
a  model  of  a  jeep.  Image  edges  automatically  extracted  by 
the  modified  Canny  procedure  are  back  projected  to  form 
the  view  solid  of  the  bottom  row.  Figure  2  illustrates  the 
second  view.  The  view  solids  are  intersected  and  the  result 
is  shown  as  a  rendered  image  in  figure  3.  The  model  is 
deceptively  simple,  since  it  was  generated  from  two  views 
90  degrees  apart.  Note  that  each  wheel  and  the  inside  of 
the  jeep  are  separate  bounding  volumes  whose  states 
(empty  or  full)  must  be  determined.  Here  the  blocks  were 
disambiguated  by  the  operator. 

3.  Models  from  Range  Data 
3.1  Previous  techniques 

Previous  work10  has  demonstrated  the  feasibility 
of  using  three  dimensional  edges  developed  from  multiple 
range  views  to  produce  wire-frame  representations  for  an 
object.  The  Wesley-Markowsky  wire  frame  fleshing  algo¬ 
rithm9  becomes  directly  applicable  to  such  a  wire  frame. 
The  separate  view  solids  thus  produced  are  transformed 
into  a  common  coordinate  system  and  intersected. 
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Figure  1.  First  view  of  a  Jeep.  The  intensity  image  and  the 
view  solid  automatically  obtained. 


Figure  2.  Second  view  of  a  jeep.  Intensity  image  and  view 
solid. 


There  are  several  problems  that  make  this  pro¬ 
cedure  difficult  to  apply.  The  edges  that  bound  a  visible 
face  are  connected  in  the  range  image.  The  casting  of  each 
intersection  vertex  into  three  dimensions  guarantees 
preserving  any  1 -cycles  in  three  dimensions.  Unfor¬ 
tunately,  there  can  be  absolutely  no  guarantee  of  of  the 
planarity  of  such  edge  cycles.  The  Markowsky-Wesley 
algorithm  identifies  possible  2-cycles.  Usually,  the  solids 
which  are  so  generated  are  ambiguous,  and  a  tree  of  possi¬ 
ble  solid/empty  labels  of  the  volumes  is  the  output  of  the 
algorithm.  While  the  ambiguity  cannot  be  completely 
resolved  when  fleshing  wire-frames  from  engineering  draw¬ 
ings,  the  ambiguity  between  empty  and  solid  is  absent  in 
range  data. 

The  new  procedure  avoids  the  planarity  issue 
entirely  by  extracting  least  mean-squared  error  faces 
directly  from  the  range  image  and  wrapping  auxiliary  faces 
to  close  2-cycles.  In  this  process,  the  range  image  disam¬ 
biguates  space  into  solid  and  empty.  The  model  resulting 
from  the  new  procedure  has  strictly  planar  faces,  and  is  a 
closed  surface. 


Figure  3.  Rendered  jeep  model  obtained  from  the  intersection  of 
the  two  view  solids  derived  from  intensity  images. 


3.2  The  Algorithm 

Edge  cycles  corresponding  to  significant  object 
features  are  formed  in  range  images.  Edges  are  of  two 
types,  curvature  and  occlusion.  Occlusion  edges  are 
represented  by  step  edges  in  a  range  image.  The  edge 
detection  techniques  typically  used  with  intensity  images 
also  provide  very  good  results  with  range  images.  The 
curvature  edges  are  problematic.  Ideally  these  edges 
should  follow  lines  of  high  curvature  representing  the 
boundaries  of  adjacent  object  faces.  In  practice  these 
edges  are  difficult  to  extract.  We  have  experimented  with 
automatic  extraction  of  curvature  edges  for  the  purposes  of 
model  creation11  but  the  results,  to  date,  have  been  disap¬ 
pointing.  Although  we  are  able  to  obtain  reliable  edge 
boundaries  for  the  intensity  images  automatically,  for  the 
purposes  of  this  paper,  edges  in  range  data  are  provided  by 
the  operator.  Work  continues  on  improving  automatic 
techniques  for  extracting  curvature  edges. 

Selected  edges  form  large  complete  cycles 
corresponding  to  major  deviations  from  planarity.  A  single 
image  is  assumed  to  contain  one  or  more  objects  fully  sur¬ 
rounded  by  a  background  surface,  identifiable  by  its  depth. 
The  outer  bounding  cycle  of  that  background  surface  prop¬ 
erly  contains  every  other  viewplane  cycle  and  is  designated 
as  the  rear  face.  Viewplane  cycles  are  converted  to  three- 
dimensional  faces  or  holes  (if  they  are  coplanar  with  the 
rear  face).  Pixels  within  the  current  face  but  not  in  a  hole 
are  sampled.  A  least-squares  surface  fit  determines  the 
geometric  parameters  of  the  three-dimensional  face  within 
the  outer  edges.  Sensor  geometry  provides  placement  of 
the  bounding  edges  to  a  plane  in  space.  The  intersection 
of  the  face  equation  and  the  plane  determined  by  the  edge 
provides  a  very  accurate  placeme  't  of  the  edge  in  three- 
dimensional  space. 

The  "holes"  in  the  outer  bounding  cycle  (or  rear 
face)  are  precisely  the  occlusion  edges  of  the  sensed 
objects.  These  may  be  outermost  edges  or  edges  surround¬ 
ing  holes  in  the  sensed  objects.  To  produce  a  model  with 
holes  that  properly  correspond  to  those  of  the  sensed 
object,  the  cycles  surrounding  holes  in  the  sensed  object 
must  be  identified.  The  faces  that  correspond  to  the  cycles 
that  bound  object  holes  are  identified  and  collected.  At 
this  point  all  of  the  three-dimensional  faces  that  correspond 
to  viewplane  cycles  are  fully  formed.  It  remains  only  to 
wrap  back-projected  faces  around  the  object  and  to  close 
the  rear  surface. 

The  first  step  in  the  back  projection  considers  each 
vertex  in  the  range  image  cycle  complex.  A  line  is  formed 
in  the  view  direction  that  includes  the  vertex.  Cycle,  edge, 
and  venex  links  permit  the  identification  of  all  the  range- 
image  cycles  that  include  the  range  vertex  in  question. 
Thus,  the  three-dimensional  faces  that  are  encountered 
along  the  view  direction  from  the  vertex  may  be  readily 
identified.  The  intersections  of  this  view  direction  line  and 
the  identified  faces  are  the  vertices  that  will  correspond  to 
the  range  vertex.  Each  three-dimensional  vertex  is  also 
paired  and  linked  with  the  appropriate  face.  These 


corresponding  three-dimensional  vertices  are  connected 
pairwise,  with  three-dimensional  edges. 

The  back  projections  are  done  on  an  edge  by  edge 
basis.  Each  range  edge  is  in  precisely  two  viewplane 
cycles  and  thus  represents  the  boundary  of  exactly  two 
three-dimensional  faces.  Ordinarily  the  same  face  is  front- 
most  at  each  end  of  the  three-dimensional  edge.  In  this 
case,  forming  the  back  projection  is  quite  easy.  The  edge 
in  the  viewplane  together  with  the  view  direction  determine 
a  plane  which  may  be  intersected  with  the  forward  and 
rearward  faces,  producing  two  edges.  There  are  four  edges 
for  each  back-projected  face:  these  two  three-dimensional 
edges  corresponding  to  the  range  image  edge,  and  two 
additional  edges  extending  back  along  the  line  of  sight, 
created  in  the  first  step  of  this  section.  There  is  another 
case  for  back-projected  faces.  The  same  face  occasionally 
will  not  be  front-most  at  both  ends  of  a  range  edge.  When 
it  is  not,  the  back  projected  surface  must  be  split  into  two 
triangles  meeting  at  a  common  three-dimensional  vertex. 
The  three-dimensional  vertices  in  the  front  and  rear  face 
for  each  endpoint  of  the  range  edge,  together  with  this 
common  vertex,  form  the  three  comers  of  each  triangular 
back-projected  face. 

Whether  by  the  simple  case  or  by  the  two  triangle 
case,  every  back  projected  face  has  now  been  constructed. 
The  final  step  constructs  a  valid  rear  face  for  the  object 
model.  Recall  that  an  outer  bounding  rear  face  was  created 
early  in  the  process.  During  the  rest  of  the  procedure, 
faces  may  have  been  created  which  were  coplanar  with  the 
rear  face.  These  rear  face  cycles  are  arranged  in  a  contain¬ 
ment  tree.  The  outer  rear  cycle,  at  the  top  level,  is  made 
up  of  the  outer-most  occlusion  edges  of  the  object  and  is 
labeled  solid.  Each  interior  cycle/face  is  labelled  in  the 
opposite  sense  of  its  ancestor.  The  resulting  tree  will 
represent  the  final  face  for  the  model. 

3.3  Experimental  Results 

Polyhedral  models  from  multiple  views  are  inter¬ 
sected  to  refine  the  model.  Figure  4  shows  our  range  cam¬ 
era.  The  mirrors  at  the  left  side  direct  a  stripe  of  light  on 
an  object  as  shown  in  figure  5.  Figure  6  shows  a  range 
image  for  a  single  view  of  a  jeep,  the  edges  identified  in 
the  view,  and  a  view  solid  produced  by  the  algorithm. 
Figure  7  shows  a  second  view  and  associated  edges.  Fig¬ 
ure  8  shows  the  intersection  of  the  two  view  solids.  It  is 
apparent  that  the  intersection  result  closely  approximates 
the  original  object. 

4.  Conclusions 

The  practical  use  of  geometric  and  topological 
properties  of  image  data  to  construct  three-dimensional 
models  has  been  demonstrated.  The  essential  information 
consists  both  of  local,  geometric  information  obtained  from 
step  and  curvature  edges  found  in  the  image,  and  the  more 
global,  topological  information  inherent  in  the  data  (the 
cycles  which  define  surface  boundaries).  Preserving  this 
information  at  every  step  of  the  model  creation  process. 


whether  for  intensity  or  range  data,  results  in  geometrically 
and  topologically  sound  volumetric  models.  Single  view 
models  may  be  integrated  by  repeated  intersection  to 
achieve  an  accurate  multiple  view  model. 

For  comparison  purposes,  the  solid  models  created 
by  from  intensity  images  and  range  images  are  compared 
in  figure  9  (the  intensity  model  is  on  the  left,  the  range 
model  on  the  right).  Both  models  well-represent  the  jeep 
object.  Similar  accuracy  should  be  available  from  both 
techniques  with  proper  care  in  calibration.  It  is  expected 
that  the  range  image  technique  can  provide  extra  detail  in 
concavities  not  readily  apparent  or  disambiguated  by  the 
intensity  technique. 
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University  of  Massachusetts 
Amherst,  Massachusetts 


ABSTRACT 

A  technique  for  computing  distance  to  environmental  sur¬ 
faces  suitable  for  obstacle  recognition  by  a  mobile  robot  is  pre¬ 
sented.  Accurate  knowledge  of  the  camera  motion  parameters  is 
not  required.  We  demonstrate  how  motion  in  depth  manifests  it¬ 
self  in  the  projected  lengths  arid  areas  of  environmental  surfaces 
whose  extent  in  depth  is  small  relative  to  their  distance  from 
the  camera.  Results  on  a  sequence  taken  by  a  mobile  robot  are 
presented  to  demonstrate  the  accuracy  of  the  method.  Finally, 
we  present  examples  of  organizational  principles  which  would 
allow  suitable  environmental  structures  to  be  recognized  auto¬ 
matically. 

1.  INTRODUCTION 

Perspective  geometry  tells  us  that  the  optical  flow  which  con¬ 
fronts  an  organism  moving  through  its  environment  is  purely  a 
function  of  the  motion  of  the  organism  and  the  distance  to  sur¬ 
faces  in  the  environment.  In  principle,  precise  knowledge  of  the 
nature  of  the  motion  would  allow  its  effects  to  be  subtracted, 
and  the  distance  to  environmental  surfaces  to  be  computed.  In 
this  way,  the  organism  establishes  a  relationship  with  its  environ¬ 
ment  17).  Conveniently,  the  motion  of  the  organism  itself,  or  the 
egomotion,  can,  in  principal,  be  computed  from  the  optical  flow. 
The  problem  of  computing  the  parameters  of  the  egomotion  is 
vastly  simplified  when  it  is  known,  a  priori ,  that  the  motion  is 
purely  translational  For  the  case  of  pure  translational  motion, 
knowledge  of  the  position  of  the  focus  of  expansion,  or  FOE , 
provides  two  of  three  translation  parameters,  the  third  being  ve¬ 
locity  in  the  direction  of  gaze  There  have  been  several  at  tempts 
to  use  this  assumption  for  computing  distance  to  environmental 
surfaces  |9.2|.  Unfortunately,  although  this  assumption  is  sound 
in  theory,  even  a  very  small  deviation  from  pure  translational 
motion  (in  the  form  of  rotation)  results  in  significant  errors  in 
the  determination  of  the  position  of  the  FOK  (see  f>  in  these 
proceedings) ,  This  renders  techniques  which  rely  on  accurate 
knowledge  of  the  position  of  the  FOK  effectively  useless. 

Although  there  has  been  some  success  in  computing  distance 
to  environmental  surfaces  assuming  completely  general  motion 
|l|,  depth  values  computed  in  this  manner  are.  likewise,  only  as 
accurate  as  the  estimates  of  the  egomotion  parameters.  The  goal 
of  computing  distance  to  environmental  surfaces  without  full  and 
accurate  knowledge  of  the  egomotion  parameters  remains  allrar- 
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In  this  paper,  we  derive  equations  illustrating  how  motion 
in  depth  manifests  itself  in  the  projected  lengths  and  areas  of 
environmental  surfaces  whose  extent  in  depth  is  small  relative 
to  their  distance  from  the  camera.  We  then  present  results  of 
an  experiment  with  an  image  sequence  from  the  mobile  robot 
domain  to  demonstrate  the  potential  accuracy  of  the  method. 
Finally,  we  present  examples  of  organizational  principles  which 
would  allow  suitable  environmental  structures  to  be  recognized 
automatically. 

2.  DEPTH  FROM  LOOMING  STRUCTURE 

Perspective  projection  can  be  approximated  by  a  scaled  or¬ 
thographic  projection  when  two  conditions  are  met  (11|.  First, 
the  depth  to  the  centroid  of  the  environmental  structure  in  ques¬ 
tion  must  be  large  with  respect  to  the  focal  length  of  the  camera. 
Second,  the  total  extent  in  depth  of  the  structure  must  be  small 
compared  to  the  depth  of  its  centroid.  We  call  an  environmental 
structure  satisfying  these  two  requirements  a  shallow  structure. 
We  assume  that  for  shallow  structures,  scaled  orthographic  pro¬ 
jection  and  perspective  projection  are  equivalent.  Assuming  that 
environmental  structure,  of  length  L ,  satisfies  the  shallow  struc¬ 
ture  requirement  and  lies  at  a  distance,  z ,  from  the  imag^  plane, 
then  its  projected  length,  1  o  will  be 


where  /  is  the  focal  length  of  the  camera 

If  the  imaging  device  is  translating  into  the  environment  with 
velocity,  T,  then  the  component  of  the  velocity  in  the  direction 
of  gaze,  Tz,  is 

Tt  ^  f  -z  =  |f|cos0  (2) 

where,  9  is  the  angle  between  the  direction  of  gaze  and  the  FOK. 
Thus,  knowledge  of  t  he  posh  ion  of  the  FOE  is  required  only  I  :.* 
compute  the  component  of  T  in  the  direction  of  gaze  Since  7\ 
is  proportional  to  cos0.  and  since  cos 9  is  essentially  equal  to 
one  when  the  FOE  and  the  direction  of  gaze  are  rlose  (which  is 
normally  the  case),  errors  of  several  degrees  in  the  determination 
of  the  position  of  the  FOE  ran  b°  tolerated  After  time,  t.  the 
projected  letigt  ii .  /]  will  be 


From  this,  we  see  that  (lie  ratio.  ^  is 

In  '  TJ 


Solving  for  z,  we  get 


This  is  essentially  the  time  adjacency  relationship. 


r,t  /,  -  i0  w 

where  l0  and  lx  are  not  the  distances  from  the  FOE  of  a  point  at 
two  different  times,  but  are  rather  the  lengths  of  the  projection 
of  some  environmental  structure.  We  can  thus  view  the  time 
adjacency  relationship  of  [9,2j  as  a  special  case  of  the  looming 
structure  relationship;  the  FOE  and  the  point  in  motion  define 
the  projection  of  an  imaginary  line  segment. 

We  can  generalize  the  looming  structure  relationship  for  pro¬ 
jected  lengths  to  projected  areas.  Assuming  that  environmental 
structure,  with  area  A,  satisfies  the  shallow  structure  require¬ 
ment  and  lies  at  distance,  z,  from  the  image  plane,  then  its 
projected  area,  ao  will  be 


After  time,  t  the  projected  area  will  be 


(*  -  rz  t)2 

As  with  lengths,  we  compute  the  ratio,  and  see  that  it  is 

an  _  (z  -  7V)2 


Solving  for  z,  we  get 


3.  EXPERIMENTAL  RESULTS 

In  order  to  test  this  idea  on  a  real  image  sequence  we  em¬ 
ployed  the  line  matching  system  developed  by  Williams  and  de¬ 
scribed  elsewhere  in  these  proceedings  f  1 2j  to  match  line  seg¬ 
ments  computed  with  Boldt's  algorithm  :3’.  The  sequence  was 
taken  with  a  cam  »ra  mounted  on  a  mobile  robot  and  has  a  ro¬ 
tational  component  large  enough  to  frustrate  an  algorithm  de¬ 
pendent  on  accurate  knowledge  ot  the  position  of  the  FOE  (Fig¬ 
ure  1).  The  line  matching  results  are  satisfactory,  although  the 
lengths  of  the  line  segments  produced  by  Boldt's  algorithm  (or 
ary  grouping  algorithm)  are  often  unreliable.  Fortunately,  the 
orientation  and  lateral  placement  of  the  lines  is  accurate,  and 
we  exploited  Mils  fact  to  define  a  virtual  hat  whose  length  *  mild 
be  accurately  measured  over  the  course  of  the  motion  sequence 
The  endpoints  of  the  virtual  line  are  defined  by  the  intersections 
of  two  pairs  of  line  segments  Knowledge  of  the  correspondence 
through  time  of  t  lie  line  segments  defining  the  virtual  line,  pro¬ 
vides  information  about  the  changing  parameters  of  the  virtual 
line  itself  .lust  as  a  virtual  line  ran  be  defined  by  two  pairs  <«f 
physical  lines,  virtual  rftjian  can  be  defined  with  three  o*-  more 
pairs  (Figure  2)  For  this  experiment,  virtual  lines  and  regions 
satisfying  the  shallow  structure  requirement  were  defined  manu¬ 
ally.  through  a  graphic  user  mlerfa  *  Although  the  \irtud  lines 
and  regions  used  in  1  his  experiment  are  nllu  r  mo-ti-Mly  di  'on- 
*  , unties  or  are  bounded  b\  them  this  not  a  requirement  In 
the  next  section,  we  present  a  set  of  organizational  roles  wheh 
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Figure  1.  The  first  frame  of  a  motion  sequence  taken  by 
mobile  robot  moving  down  a  hallway. 
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would  allow  a  large  number  of  structures  satisfying  the  shal¬ 
low  structure  requirement,  both  virtual  and  non-virtual,  to  be 
recognized  automatically.  The  virtual  lines  and  regions  used, 
appear  with  labels,  in  Figures  3  and  4.  To  increase  accuracy, 
each  depth  value  was  computed  over  the  largest  interval  that  all 
line  segments  defining  the  virtual  line  or  region  were  tracked, 
that  is  to  say,  until  one  line  exited  the  image  or  failed  to  have 
an  acceptable  match. 

Knowledge  of  the  position  of  the  FOE  improves  the  accuracy 
of  depth  estimates  but  is  not  critical.  For  this  experiment,  the 
position  of  the  FOE  was  estimated  by  hand,  although  algorithms 
exist  which  are  at  least  as  accurate  |9j.  The  robot  moved  a 
distance  of  1.95  feet  between  frames.  Knowing  the  position  of 
the  FOE  allowed  us  to  estimate  T*,  the  component  of  the  robot  s 
motion  in  the  direction  of  gaze,  as  1.91  feet  This  results  in  less 
than  a  3%  increase  in  accuracy  over  simply  assuming  that  the 
FOE  and  the  direction  of  gaze  are  identical.  The  depths  to  the 
virtual  lines  are  shown  in  Table  1,  along  with  the  ground  truths, 
percent  errors  and  the  number  of  frames  contributing  to  the 
depth  estimate.  Table  2  displays  the  same  information  for  the 
virtual  regions. 
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Figure  3.  The  line  segments  used  to  define  virtual  lines. 
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Table  2. 


|  Virtual  Region 

Depth  (ft.) 

Ground  Truth  (ft.) 

%  Error 

t 

j  Cone  I 

20.1 

20.0 

0.5 

J 

Cone  2 

25.8 

25.0 

3.2 

3 

j  Cone  3 

35.5 

35  0 

14 

1 

Cone  4 

40.0 

40.0 

0.0 
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4.  RECOGNIZING  SHALLOW  STRUCTURES 

If  there  were  no  way  to  discriminate  environmental  struc¬ 
ture  satisfying  the  shallow  structure  requirement  from  structure 
which  doesn't,  then  the  depth  from  looming  method  would  be 
of  limited  use  Fortunately,  this  is  not  the  case.  Organizational 
rules  exist  which  would  allow  a  perceptual  organization  process 
fo  automatically  recognize  a  large  class  of  shallow  structures 
All  of  the  organizational  rules  which  we  employ  are  derived  from 
two  simple  observations  First,  certain  organizations  of  primitive 
structural  descriptors  are  unlikely  to  occur  through  pure  chance. 
This  i<5  the  nov-arruirntalnr.*.*  principle  described  by  Lowe  10 
Second .  since  we  defined  shallow-  structures  t < »  be  those  struc¬ 
tures  whose  perspective  project i* in*-  are  risen  t wills  *h‘Ui  t*> 

scaled  ort  hog  rapine  pp  tject  ions,  we  can  use  la*  k  *  >f  *b-t  •  »r  t  *<  ui  due 
to  perspective  as  a  means  of  recognizing  them 
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Figure  I.  The  line  segments  used  to  deli  tie  virtual  rtgionf*. 


Spacing.  Collections  of  equally  spaced  point  tokens  or  par¬ 
allel  lines  are  highly  unlikely  to  occur  through  pure  chance. 
When  we  detect  an  arrangement  of  equally  spaced  P<,'nl  ,nkens 
or  parallel  lines  in  the  image  we  can  conclude  not  only  that  the 
corresponding  environmental  structures  are  also  equally  spaced 
1 10] ,  but  also  that  the  extent  in  depth  of  the  entire  collection 
is  small  compared  to  its  distance  from  the  camera.  If  this  were 
not  the  case  perspective  effects  would  destroy  the  equal  spacing 
(Figure  5). 


Figure  5.  Prom  the  spacing  of  the  windows  on  the  building 
in  the  background,  and  of  the  steps  in  the  foreground,  we  can 
infer  that  the  distance  to  each  structure  is  large  relative  to  its 
extent  in  depth. 

ParnllehiesH.  Barring  accident  al  arrangement,  parallel  lines 
in  the  image  are  parallel  in  the  environment  |8,K)|  Surprisingly, 
we  can  also  infer  t  hat  each  line  individually  is  a  shallow  structure, 
otherwise  the  lines  would  converge  due  to  perspective.  Thomp¬ 
son  ami  Mundy  |l  l|  ejte  this  as  an  invariant  property  of  scaled 
orthographic  projections  (Figure  f>). 

Orthogonality.  Orthogonality  is  a  compelling  lion-accidental 
relationship,  especially  in  manmade  environments  Perpendicu¬ 
lar  lines  in  t  he  environment  will  appear  perpendicular  in  the  im¬ 
age  only  if  t  hey  both  lie  in  a  plane  satisfying  the  shallow  structure 
requirement  (Figure  7) 

Symmetry.  We  were  tempted  to  call  this  rule  “lion-skewed 
symmetry’  because  it  represents  tin  degenerate  case  of  Kanade 
and  Kender’s  skew  symmetry  met.liod[He  Simply  slated,  sym- 
met  n<  configurations  in  t  he  image  remain  symmel  ri<  only  if  I  hey 
satisfy  1  lie  shallow  structure  requirement  (Figure  8) 

Some  work  has  already  been  conducted  at  the  University  of 
M  nssarhuscMs  whe  h  can  be  directly  applied  to  the  problem  of 
finding  shallow  structures  Reynolds  and  Beveridge  describe  a 
method  for  detecting  parallel  and  orthogonal  structures  in  aerial 
images  hi  a.  Significant  I v.  any  line  segment  which  is  a  member 
of  some  equivalence  c  lass  of  parallel  lines  or  lies  perpendicular 
to  a  line  hi  such  a  «  lass  ran  be  classified  as  a  shallow  structure. 
We  also  hope  to  employ  a  generic  graph  matcher  implemented 
hv  Brolio  and  briefly  desi  ribrd  m  I  to  recognize  three  line  con¬ 
figurations  exhibiting  a  simple  bilateral  symmetry 


Figure  0.  An  equivalence  class  of  parallel  lines,  earn  of  which 
can  be  considered  a  shallow  structure. 


Figure  7.  Perpendicular  lines  appear  perpendicular  only  if 
they  satisfy  the  shallow  structure  requirement. 


I' igure  8.  The  traffic  cones,  trash  ran  and  doorway  exhibit 
a  simple  bilateral  symmetry  that  lias  not  been  distorted  by  the 
effects  of  perspective  We  can  conclude  that  I  lie  center  line  of 
rac  h  <  oidigurat  ion  is  a  shallow  st  rue  i  ure 
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5.  CONCLUSION 

We  have  demonstrated  that  in  certain  cases,  depth  to  en¬ 
vironmental  surfaces  can  be  computed  from  a  motion  sequence 
without  first  completely  determining  the  egomotion  parameters. 
Depth  information  manifests  itself  not  only  in  image  plane  ve¬ 
locities,  but  also  in  the  changing  lengths  and  areas  of  structural 
descriptors.  A  simple  formulation  for  the  special  case  of  envi¬ 
ronmental  structure  whose  extent  in  depth  is  small  compared  lo 
its  distance  from  the  camera  has  been  derived.  The  potential 
accuracy  and  utility  of  the  method  has  been  demonstrated  in 
an  experiment  with  an  image  sequence  from  the  mobile  robot 
domain.  Although  the  straight  line  configurations  emploved 
in  these  experiments  were  defined  manually,  work  is  underway 
on  the  problem  of  automatically  extracting  these  configurations 
from  images  based  on  general  organizational  principles  including 
spacing,  parallelness,  orthogonality  and  symmetry. 
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Abstract 

Superellipsoids  are  parameterized  solids  which  can  appear  like 
cubes  or  spheres  or  octahedrons  or  8-pointed  stars  or  anything  in 
between.  They  can  also  be  stretched,  bent,  tapered  and  combined 
with  boolean  to  model  a  wide  range  of  objects.  Columbia’s  vision 
group  is  interested  in  using  superquadrics  as  model  primitives  for 
computer  vision  applications  because  they  are  flexible  enough  to 
allow  modeling  of  many  objects,  yet  they  can  be  described  by  a  few 
(5-14)  numbers.  This  paper  discusses  research  into  the  recovery 
of  superellipsoids  from  3-D  information,  in  particular  range  data. 

This  research  can  be  divided  into  two  parts,  a  study  of  po¬ 
tential  error-of-fit  measures  for  recovering  superquadrics,  and  im¬ 
plementation  and  experimentation  with  a  system  which  attempts 
to  recover  superellipsoids  by  minimizing  an  error-of-fit  measure. 

This  paper  presents  an  overview  of  work  in  both  areas.  In¬ 
cluded  are  data  from  an  initial  comparison  of  4  error-of-fit  mea¬ 
sures  in  terms  of  the  inter-relationship  between  each  measure  and 
the  parameters  defining  the  superellipsoid.  Also  discussed  is  an 
experimental  system  which  employs  a  nonlinear  least  square  mini¬ 
mization  technique  to  recover  the  parameters.  This  paper  discuss 
both  the  advantages  of  this  technique,  and  some  of  its  major  draw¬ 
backs.  Examples  are  presented,  using  both  synthetic  and  range- 
data.  where  the  system  successfully  recovers  superlliposids,  in¬ 
cluding  “negative”  volumes  as  would  occur  if  superellipsoids  were 
used  in  a  constructive  solid  modeling  system. 


1  Introduction 

As  a  modeling  primitive,  superquadrics  provide  a  natural  exten¬ 
sion  to  traditional  CAD  models.  [Harr  SI].  Recently  they  have 
become  the  focus  of  a  few  computer  vision  research  efforts,  see 
[I’entland  80a],  [Pentland  87],  [Bajcsy  and  Solina  87b],  [Solina  87], 
and  [Boult  and  Gross  87b] .  Most  of  the  research  to  date  lias  con¬ 
centrated  on  a  subclass  of  superquadrics  called  superellipsoids 
(hereafter  SEs).  The  growing  research  interests  in  SEs  can  par¬ 
tially  be  attributed  to  the  fact  that  they  can  morel  a  wide  range  of 
object  in  a  very  compact  form.  Another  contributing  factor  is  the 
underlying  mathematical  formulation  which  provides  convenient 
fools  for  their  recovery. 

'file  next  section  of  this  paper  presents  a  mathematical  def¬ 
inition  of  superellipsoids  and  a  little  motivation  for  using  SE’s 
in  computer  vision.  Section  3  defines  various  error-of-fit  measures 
w  hich  could  be  used  in  t he  recovery  of  SE’s,  and  presents  an  initial 


comparison  of  these  measures. 

The  current  system  is  described  in  Section  4.  Briefly  it  em¬ 
ploys  a  nonlinear  least  square  minimization  technique  on  the  inside- 
outside  function  to  recover  the  parameters.  Oddly,  this  function 
is  one  of  the  poorer  error-of-fit  measures  considered,  yet  it  is  suf¬ 
ficient  to  allow  for  reasonable  (qualitive)  recovery  of  SEs  from 
both  synthetic  and  range  data.  A  few  examples  of  the  results  of 
the  system  are  then  presented.  The  examples  include  the  use  of 
very  sparse  data,  synthetic  multiple  views  and  the  recovery  of  a 
negative  superellipsoid  from  range  data. 

Previous  papers  by  the  authors,  [Boult  and  Gross  87b]  and 
[Boult  and  Gross  87a],  discuss  a  few  other  difficulties  that  are 
encountered  when  modeling  objects  with  superellipsoids,  or  at¬ 
tempting  to  recover  superquadrics  from  3D  data.  These  difficul¬ 
ties  include  the  general  problems  associated  with  non-orthogonal 
representations,  the  difficulties  of  dealing  with  objects  which  are 
not  exactly  representable  with  CSG  operations  on  the  primitives, 
the  need  to  recognize  negative  objects,  and  certain  numerical  in¬ 
stabilities. 

Some  pros  and  cons  of  the  approach  as  well  as  few  conclusions, 
and  a  discussion  of  work  in  progress  at  Columbia’s  vision  lab 
appear  in  the  final  section. 

The  main  result  in  this  paper  is  that  there  are  problems  and 
biases  with  currently  used  error-of-fit  measures,  as  well  as  with 
two  newly  proposed  measures.  It  also  suggests  that  even  better 
measures  await  definition,  especially  ones  measuring  error  using 
knowledge  of  the  sensor  direction.  Another  important  result  of 
the  research  is  that  least  square  minimization,  (even  using  a  poor 
error  of  fit  measure),  allows  recovery  of  both  positive  and  negative 
instances  of  superellipsoids  from  depth  data. 


2  Background  and  Motivation 


Mathematically,  SE  solids  are  a  spherical  product  of  two  superel¬ 
lipse  curves  (see  [Gardiner  65]).  Superellipse  curves  are  simi¬ 
lar  to  traditional  ellipses  except  the  terms  in  the  definition  are 

raised  to  parameterized  exponents  (not  necessarily  integers),  i.e.: 
1  l 

(?)'  +  (£)*  =  1.  When  e,  the  relative  shape  parameter  is  1,  the 
curve  describes  an  ellipse.  As  the  relative  shape  parameter  varies 
from  1  down  to  0,  the  shape  becomes  progressively  squarish;  as 
it  varies  from  1  toward  2,  the  shape  transforms  from  a  ellipse  to 
a  diamond  shaped  bevel.  When  the  parameter  is  greater  than 
2.  the  shape  becomes  pinched  and  as  the  parameter  approaches 


I 


Figure  2:  Examples  of  SE’s  deformed  by  bending  and  tapering 


Figure  1:  Superellipsoids  with  relative  shape  parameter  having 

values  .1,  .5,  1,  1.5  and  2  (left  to  right),  and  ii  having  values  .1,  effect  of  these  deformations  on  error-of-fit  measures  and  potential 
.6  and  1  (top  to  bottom).  implementations. 


infinity,  the  shape  approaches  a  cross. 


The  result  of  the  spherical  product  of  two  such  curves  is  con¬ 
veniently  represented  in  a  parametric  form,  e.g.,  a  superellipsoid 
can  be  represented  as: 

" . . .  -§<')<! 


s(r?,u>)  = 
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for  any  fixed  positiveux,av,aj,€i,  and  C2. 


n't 


u 
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(1) 


The  parameters  as,ay,  a,  affect  the  size  of  the  superellipsoid  along 
the  i,y  and  z  axes  (respectively)  in  the  object  centered  coordi¬ 
nate  system.  The  parameters  fj  and  effect  the  relative  shape 
of  the  superellipsoid  in  the  latitudinal  (zz)  and  longitudinal  (xy) 
directions,  see  figure  1.  When  the  5  parameters  are  all  unity,  the 
superellipsoid  is  the  unit  sphere. 


Superellipsoids  can  also  be  defined  implicitly  as  the  the  vol¬ 
ume  where  the  SE  inside-outside  function  is  negative,  where  the 
SE  inside-outside  function  is  given  by: 


if 


where 

/(z,  y,  z)  =  1,  then  x.  y,  z  is  on  the  surface, 
f(i.y,  z)  <  1,  then  x,y.  z  is  inside  the  volume, 
/( x.  y,  z)  >  1 ,  then  x,y,z  is  outside  the  volume. 


(2) 


Definitions  in  hand,  let  us  now  discusses  some  of  the  rational 
for  using  SE’s  for  object  recognition.  In  short,  they  are  flexible 
enough  to  represent  a  wide  class  of  objects,  but  are  simple  enough 
to  be  recovered  from  3-D  data.  Their  functional  form  and  inside- 
outside  function  provides  a  useful  tools  to  aid  in  their  recovery. 

Traditional  constructive  solid  modeling  systems  use  boolean 
operations  to  combine  primitives,  such  as,  spheres,  cylinders,  and 
rectangular  solids.  By  extending  the  primitive  shapes  to  SEs, 
we  allow  the  user  to  easily  produce  a  continuum  of  forms,  from 
spheres,  to  cuboids  with  rounded  corners,  to  cubes,  to  diamond 
shapes.  Such  objects  are  difficult  to  model  with  traditional  con¬ 
structive  solid  geometry  (hereafter  CSG)  systems.  Making  them 
primitives  simplifies  the  job  of  the  designer  for  any  problem  that 
contains  objects  with  these  properties.1 

However,  adding  flexibility  above  that  of  a  traditional  CSG 
system  is  not  the  only  reason  to  choose  superquadrics.  For  added 
flexibility,  one  could  make  the  primitives  of  the  system  General¬ 
ized  Cylinders  or  Generalized  Cones,  see  [Brooks  and  Binford  80]. 
The  problem  with  GCs  is  that  recovering  them  is  a  difficult  pro¬ 
cess,  partially  because  each  GC  may  require  hundreds  of  pa¬ 
rameters  to  describe  it.  The  number  of  parameters  necessary 
to  define  a  GC  depends  on  the  complexity  of  the  cross  section 
function,  the  spine,  and  sweeping  function.  As  in  [Brooks  85], 
[Kao  and  Nevatia  86],  or  [Ponce,  Chelberg  and  Mann  87],  one  gen¬ 
erally  has  to  greatly  restrict  the  class  of  GCs  allowed  before  one 
can  reliably  recover  them. 


These  equations  define  superellipsoids  (hereafter  referred  to  as 
SFs)  which  are  probably  the  simplest  of  a  family  of  surfaces  called 
superquadrics.  SE's  are  the  only  superquadrics  which  can  be  made 
into  a  convex  solid.  One  can  also  define  superhyperboloids  of  one 
or  two  sheets  and  supertoroids,  see  [Barr  81].  In  addition  to  the 
variety  of  shapes  defined  by  the  basic  superquadrics,  Barr  also  dis¬ 
cusses  the  application  of  angle-preserving  transforms  which  allow 
translation,  rotation,  bending  and  twisting.  With  the  addition 
of  tapering  and  traditional  boolean  combination  operations,  su¬ 
perquadrics  become  a  powerful  modeling  tool,  see  Figure  2  for 
some  examples  generated  using  SnperSkotrh.  [Pentland  86b]. 

While  translation,  rotation,  bending  and  tapering  are  impor¬ 
tant  modifications  to  basic  SE  primitives,  they  will  not  be  consid¬ 
ered  in  the  remainder  of  the  paper.  Future  work  will  examine  the 


It  is  interesting  to  note  that  superellipsoids  and  supertoroids 
can  be  defined  as  a  subclass  of  GC;  SEs  are  those  straight  spincd 
GCs  with  their  cross  section  and  sweeping  functions  defined  by 
superellipses.2  If  the  SEs  are  bent  or  tapered,  these  dcfomia- 
tions  are  applied  to  the  spine  and  sweeping  function  respectively. 

JA  simple  CAD  system  has  been  developed  by  A  Pentland.  see 
[Pentland  86b]  that  combines  SEs  (constrained  to  only  those  that  are  convex 
solids)  with  CSG  type  operations  plus  bending  and  tapering.  The  ease  with 
which  people  became  accustomed  to  modeling  with  this  system  may  be  tied 
to  their  own  internal  representation  of  the  objects.  [Pentland  86a]  presents 
arguments  to  motivate  the  use  of  superquadrics  as  modeling  primitives  by 
showing  correspondences  to  human  vocalization  of  object  descriptions. 

2The  sweeping  rule  is  highly  nonlinear,  and  always  goes  to  zero  The  au¬ 
thors  do  not  know  of  any  system  which  can  recover  a  subset  of  GC'  containing 
this  class. 


’.053 


• » - « 
& 


Thus  superquadrics  provide  a  restriction  which  may  allow  effi¬ 
cient  recovery  as  well  jus  a  simpler  set  of  “limb  equations”(after 
[Ponce,  Chelberg  and  Mann  87j).  In  addition,  they  provide  math¬ 
ematical  properties  (the  explicit  parametric  definitions,  inside- 
outside  function  (implicit  definition)  and  a  normal  surface  duality 
principle)  which  do  not  exist  for  most  classes  of  GCs, 

The  inside-outside  function  provides  a  useful  tool  for  recov¬ 
ery,  because  it  provides  a  simple  way  to  determine  which  of  the 
data  points  are  inside  the  surface.  Furthermore,  the  value  of 
the  function  grows  as  points  are  moved  further  from  the  surface. 
Thus  it  ran  be  used  in  conjunction  with  a  minimization  tech 
nique  to  recover  the  surface.  However,  the  inside-outside  function 
is  not  a  euclidean  “error  of  fit”  measure  (see  the  following  sec¬ 
tion),  and  therefore  direct  minimization  of  it  will  not  necessarily 
provide  satisfactory  results.  (However,  in  two  independent  imple¬ 
mentations  this  technique  has  produced  satisfactory  results,  see 
[Bajcsy  and  Solina  87b]  and  [Iloult  and  Gross  87b],  and  section 
4.) 

Canonical  superquad rics  (except  the  supertoroids)  have  a  de¬ 
sirable  duality  property.  The  superquadric  normal  vectors  lie  on  a 
dual  superquad ric  form;  that  is,  if  the  normal  vectors  were  trans¬ 
lated  to  the  origin,  they  would  generate  another  superquadric 
of  the  same  class  such  that  the  normal  of  the  new  surface  (if 
translated  to  the  origin)  would  produce  a  copy  of  the  original 
superquadric  (except  for  a  translation).  One  can  easily  derive 
the  form  of  the  normal  surface  as  a  spherical  product  as  well 
as  its  own  inside-outside  function,  see  [Harr  81],  Note  that  the 
inside-outside  function  for  the  surface  normals  might  be  used  to 
find  superquadrics  fitting  surface  normal  information  as  would  be 
available  from  shape-frottt-X  methods. 


Comparison  of  four  “Error  of  fit”  mea- 


The  recovery  of  superellipsoids  from  depth  data  lias  recently  been 
the  subject  of  research  in  various  labs,  e.g.  see  [Pentlaiid  87], 
[Bajcsy  and  bolina  87b),  and  [Boult  and  Gross  87b],  Each  of  the 
approaches  has  something  in  common:  at  some  point  they  define 
some  measure  of  the  “error  of  fit,”  (hereafter  EOF),  and  they  use 
a  nonlinear  minimization  technique  to  recover  the  parameters  of 
the  superellipsoid  starting  from  an  initial  estimate.1  It  is  well 
known,  from  work  on  fitting  curves  to  points  in  the  plane,  that 
different  EOF  functions  can  result  in  radically  different  curve  es¬ 
timations,  e.g.  see  [Pratt.  87],  (Sampson  82 j,  [Bookstein  7f)]  and 
[Turnor  74].  On  close  examination  of  Columbia's  initial  system 
to  recover  SEs,  it  became  apparent  that  the  EOF  function  being 
used  was  not  even  proportional  to  sensor  errors  (determined  using 
synethic  data).  The  purpose  of  this  section  is  to  examine  some  of 
the  measures  of  the  error  of  fit  used  by  these  researchers  as  well  as 
some  measures  not  yet  used  in  minimization  procedures.  While 

‘'Th^rr  ae*  many  difference**  in  the  detail*  of  the  algorithms.  The  work  of 
[Bajcsy  and  Solina  87b]  and  [Boult  anti  dross  87b]  differ  mostly  in  minor  de 
♦  „i!s.  and  both  are  driven  mainly  by  the  minimization  approach  with  only  mi¬ 
nor  effort  to  derive  good  initial  e*li mates  of  the  parameters.  The  approach  in 
(Pentland  87]  is  radically  different  using  a  detailed  nonadaptjve  initial  search 
A  the  parameter  space  (at.  M.000  locations),  but  still  maintains  the  mini 
mization  of  some  error  of  fit  measure  after  that  initial  sear«h  through  the 
parameter  space 


Figure  3:  Drawing  showing  the  intuitive  meaning  of  measures. 
(This  is  an  x-y  cross  section  of  an  SE)  EO  !■  \  a:  (37^7*37)  M, 
EOE2  «  (37+37)  ,  EOi\  as  (ax  ■  ay  -at)-  (377*37)  ,  EO  l\  - 


rotation  and  translation  are  not  going  to  be  considered  herein, 
they  are  easily  added  to  any  of  the  measures  tinder  consideration. 

In  particular,  this  section  considers  1  measures  of  error  of  fit. 
This  section  defines  each  of  these  mathematically  and  also  gives 
a  brief  intuitive  description  in  2  dimensional  physical/geometrical 
terms. 


3.1  EOFxt  Mean-square-value  of  the  inside-outside 

function 

The  first  measure  to  be  considered  is,  in  some  sense,  the  simplest. 
It  is  directly  based  on  the  inside-outside  function  which  can  be 
used  to  implicitly  define  the  SE. 

EOE\  =  —  Y  (l  -  (3) 

1=)  ...n 

where  x,,  arc  the  given  depth  data  points  converted  to  a  sin¬ 
gle  viewpoint,  and  where  /(x,,  z,;  <  j ,  <  2.  (ij..  ay,  a. )  ~  /(*.//.') 

where  G , (2>an<iy*a:  ar<’  parameters  in  equation  2.  A  rough  in¬ 
tuitive  definition  of  this  measure,  see  figure  if,  is  as  a  power  of  the 
ratio  of  the  length  of  two  line  segments.  The  exponent  to  which 
the  ratio  is  raised  depends  on  the  squareness  -round  ness  in  the  x; 
direction  and  is  given  by  The  denominator  of  the  ratio  is  the 
length  of  the  line  segment  between  the  data  point  and  the  renter 
of  the  SE.  The  numerator  is  the  length  of  that  portion  of  the  above 
mentioned  line  segment  connet  ting  the  center  of  the  SE  with  that 
point  on  the  surface  in  the  direction  of  the  data  point.  Given  (his 
intuitive  definition,  it  is  obvious  that  the  measure  can  decrease 
if  the  object,  .size  grows  even  if  the  actual  distance  to  the  surface 
increases.  Additionally,  for  small  values  of  f(.  this  measure  can 
be  quite  large  (or  small  depending  on  ratio),  and  computation 
ally  unstable.  I  bis  will  be  borne  out  through  the  computational 
evaluation  in  later  sections. 
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3.2  EOF 2:  “Corrected”  mean-square  inside-outside 

measure 

The  second  measure  to  be  considered  is  a  modified  version  of  the 
first  to  remove  some  its  most  undesirable  properties:  the  rapid 
grown  of  EOF\  for  small  values  of  q,  and  some  of  its  biases.  The 
new  measures  is  given  by: 

EOF2  =  i  '52  -  fl'(xi,yi,zi,ii,<;2,ax,ay,az))i  (4) 

i=l.  ..n 

where  terms  are  as  defined  for  EOF\ 

The  intuitive  definition  of  this  measure  is  as  the  same  ratio 
as  for  EOFi  except  that  the  power  to  which  the  ratio  is  raised  is 
simply  2.  Thus,  this  measure  gives  a  type  of  “relative  error”  in  a 
radial  direction,  however,  it  greatly  effects  measurements  in  the  2 
direction  (the  drawing  above  ignores  that  direction).  As  before,  it 
is  obvious  that  the  measure  can  decrease  if  the  object  size  grows 
even  if  the  actual  distance  to  the  surface  increases.  Note  that  to 
date  this  particular  measure  has  not  been  used  in  a  SE  recovery 
system. 


3.3  EOF3:  a  “  minimal  volume”  function 

Another  EOF  function,  used  by  [Bajcsy  and  Solitta  87b],  is 
EOF3  = 

;Ei=i...n  (v/“i  •  “(■“>  •  (1  -  /<'(*i,l/.,Zi,C|,C2,aI,a!),a2))) 

(5) 

where  terms  are  as  defined  for  EOF\.4 

A  variant  of  this  error  of  fit  measure  was  heuristically  first 
motivated  in  [Bajcsy  and  Solina  87b]  by  the  idea  that  there  are 
many  surfaces  which  might  fit  data  from  a  single  view,  and  the 
system  should  recover  the  surface  of  minimal  volume.  The  actual 
equations  10  and  11  in  [Bajcsy  and  Solina  87b],  as  well  as  similar 
equations  in  [Bajcsy  and  Solina  87a),  suggest  a  multiplicative  fac- 


•  a,  rather  than 


However,  this  is  not  in 


keeping  with  the  heurstic  definition,  and  the  latter  form  appears 
in  [Solina  87].  While  not  reported  here,  experimental  compar¬ 
isons  have  found  the  the  factor  of  aT  ■  ay  •  az  results  in  a  meausre 
wchich  has  similar  properties,  but  exaggerates  the  poor  behavior 
of  EOF). 

The  rough  intuitive  definition  of  EOF 3  is  simply  the  square 
of  the  ‘‘relative  error”  in  the  radial  direction  multiplied  by  the 
square-root  of  volume  of  the  cube  bounding  the  SK  (which  is 
meant  to  be  a  crude  approximation  to  the  SK  itself).  The  advan¬ 
tage  of  this  is  that  in  a  pure  relative  error  measure,  the  absolute 
error  in  the  radial  direction  was  divided  by  a  term  containing  the 
the  radial  distance  to  the  surface  in  some  direction.  Thus  with 
the  points  inside  the  object,  the  object  could  grow  and  reduce 
EOF\  or  EOF2  (with  respect  to  one  point)  while  the  absolute 
error  would  grow.  By  multiplying  by  the  volume  of  the  hound¬ 
ing  rectangular  solid,  this  behavior  is  curtailed.  However,  this 

*  Actually,  a  fr*w  minor  differences  between  this  definition  ami  the 

fines  in  [Bajcsy  and  Solina  87 h].  The  first  difference  is  the  multiplication  l>y 
—  which  normalizes  by  the  number  of  data  points  1  h«  second  difference 
is  that  they  use  a  homogeneous  transformation  to  deal  with  rotation  and 
translation,  thus  the  derivation  of  l'.() F\  in  that  paper  is  not  given  in  terms 
of  the  /  above,  hut  this  formulation  can  be  easily  shown  to  be  equivalent. 


behavior  is  replaced  with  the  possibility  of  reducing  the  measure 
by  keeping  the  relative  error  in  the  radial  direction  constant,  but 
reducing  the  size  of  the  SE  (i.e.,  the  measure  is  sensitive  to  scale). 


3.4  EOFi-.  A  measure  based  on  true  euclidean  dis¬ 
tance 

A  final  test  measure  is  defined  as  the  mean  distance  between  each 
data  point  and  the  corresponding  point  on  the  surface  of  the  su¬ 
perquadric  on  the  line  connecting  the  data  point  and  the  center 
of  the  SE.  In  equation  form  this  becomes: 

EOFi  =  ££i=l...n  \x,  - 

_ _  (6) 

=  ^E,=  l...n  \J(X<  -  9i,)2  +  hi  -  9iy)2  +  (*i  -  ?.-,)* 

where 

q,  =  s( »j,u»),  u>  =  (tan-1  )  , 

r,  =  (tan-1  (f^*-  •  sin‘*(u>))  , 

and  ?(-,-)  is  defined  as  in  equation  1,  and  the  relative  signs  of 
Zi,S(i,Zi,  the  itk  data  point,  are  used  to  determine  the  correct 
spherical  quadrant  of  the  arctan. 

This  measure  overestimates  the  minimal  distance  of  a  point 
from  the  SE,  especially  for  squarish  ones.  However,  it  is  a  true 
euclidean  distance  measure,  and  thus  is  not  effected  by  overall 
scaling  of  the  problem.  There  are  probably  better  euclidean  based 
measures,  however,  the  only  better  measures  currently  known  to 
the  authors  do  not  have  a  functional  form  defining  them,  and  thus 
are  difficult  to  minimize. 


3.5  Evaluation  of  the  Different  Error  of  Fit  Mea- 


rhis  section  begins  the  analysis  of  the  various  error  of  fit  measures. 
The  main  vehicle  foi  presentation  will  be  graphical,  witli  running 
commentary  giving  one  interpretation  of  the  results.  The  reader 
is  advised  to  draw  her/his  own  conclusions. 

The  evaluation  is  in  terms  of  two  properties  of  the  measures: 
parameter  bias,  and  shape  of  rrossections  of  the  error  surface. 
The  parameter  bias  makes  it  difficult  to  compare  two  potential 
solutions  and  also  generally  affects  the  minimization  by  shifting 
the  location  of  the  minima.  The  relation  of  the  measure  (in  terms 
of  the  scale  of  values)  to  the  RMS  error  of  data  is  a  by  product,  of 
the  experiment  to  determine  bias.  The  shape  of  cross  sections  of 
the  error  surface  is  important  because  it  affects  both  t  lie  possibil¬ 
ity  of  minimization  (in  particular,  is  it  relatively  concave?),  and 
the  rate  of  convergence  of  iterative  minimization  techniques  (how 
large  is  the  gradient?). 


3.5.1  Determination  of  Measure  Bias 

To  compare  the  parameter  bias,  each  measure  will  be  computed 
using  the  “correct”  values  of  the  r>  parameters  (i.e.  t  he  parameters 
used  to  generate  a  surface).  Of  course,  because  of  1  lie  noise  in  the 
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data  points,  these  parameters  will  generally  not  be  those  param¬ 
eters  for  which  a  given  measure  is  minimized.  (More  about  that 
later).  The  different  cases  considered  in  this  section  should  give 
the  reader  a  good  intuitive  feeling  for  how  the  different  measures 
are  biased  (with  respect  to  the  true  RMS  error)  as  the  parameters 
defining  an  SE  are  varied.  To  cut  down  on  the  amount  of  analy¬ 
sis,  certain  symmetries  have  been  anticipated,  in  particular  ay  is 
assumed  to  be  identical  to  ax.  The  discussions  in  this  section  are 
a  review  of  the  computational  experiment,  and  only  represent  a 
small  sample  of  the  experimental  data. 

Each  of  the  “bias”  graphs  show  a  sequence  of  computational 
experiments  where  a  single  parameter  is  varied  over  some  range. 
The  values  are  scaled,  and  the  pattern  of  the  scaled  values  often 
provide  a  good  feel  for  the  bias  of  measure.  The  scale  values 
themselves  show  the  relative  difference  in  the  size  of  the  error 
measure  (using  the  “correct”  answer)  with  respect,  to  the  RMS 
error  added  to  the  data. 

The  data  in  each  graph  was  computed  as  follows: 

Step  1  An  initial  set  of  parameters  for  the  SE,  error  range,  and 
sensor  direction  (x,y,  or  2)  are  chosen.  (Only  x  direction 
presented  here).  Also,  a  parameter  is  chosen  to  vary  in  the 
iterations.  Each  iteration  generates  one  data  point  on  the 
graph  for  each  of  the  different  measures,  computed  in  steps 
2  through  5. 

Step  2  For  a  given  set  of  SE  parameters  (i.e.  each  iteration), 
500  points  are  generating  on  the  surface  of  the  SE  assum¬ 
ing  a  uniform  pseudo-random  distribution  of  angles  on  the 
viewing  hemisphere  for  the  chosen  sensor  direction. 

Step  3  For  each  data  point,  a  pseudo  random  noise  value  is  com¬ 
puted  assuming  a  uniform  distribution  in  a  predetermined 
range.  The  noise  is  added  to  the  data  point  in  the  direction 
of  the  sensor.  The  value  is  directly  added  to  the  variable 
used  to  compute  the  RMS  measure. 

Step  4  Using  the  exact  value  of  the  set  of  parameters  for  this 
iteration,  and  the  noisy  data  computed  in  steps  2  and  3, 
the  value  of  each  of  the  measures  EOF\,  EOF2 ,  EOF3  and 
EOF4  is  computed  and  recorded. 

Step  5  For  each  measure  (including  RMS)  the  recorded  values 
are  examined  and  the  maximum  ■f  a  measure  is  used  to 
re-scale  its  value  into  [0,1]  and  the  value  is  entered  on  the 
graph. 

Before  beginning  the  description  of  the  initial  results,  it  is 
necessary  to  make  a  few  comments  about  the  graphical  presenta¬ 
tion  of  the  results.  First  note  that  each  graph  contains  5  different 
marks  which  are  used  to  show  the  scaled  value  of  the  measures. 
Also,  each  graph  has  a  legend  with  scaling  factors.  Thus  in  fig¬ 
ure  4,  the  computed  value  of  EOF\  for  (xy  =  .5  is  approximately 
.64  x  .176,  and  when  fry  =  1.8  is  approximately  .91  x  .176.  Look¬ 
ing  at  the  tick-marks  allows  one  to  obtain  a  feel  for  the  measure’s 
bias  as  the  parameter  varies,  and  the  scaling  provides  a  means  of 
relating  the  measure  to  the  actual  RMS  error.  For  space  consid¬ 
erations,  only  a  few  examples  are  presented  here.  However,  the 
running  commentary  in  this  section  is  based  on  the  analysis  of 
over  500  of  the  above  computational  experiments. 


The  interest  in  bias  factors  stems  from  their  importance  in 
comparing  the  realtive  quality  of  two  potential  solutions.  For 
example,  if  the  bias  in  a  parameter  P  is  toward  larger  values  then 
if  two  candidate  sets  of  parameters  differ  only  in  parameter  P, 
the  it  can  occur  that  the  set  with  the  larger  P  value  may  report 
a  larger  measure  but  have  a  smaller  actual  RMS  error.  Thus  to 
properly  compare  two  potential  solutions,  one  needs  to  be  able  to 
estimate  the  bias  in  each  of  the  parameters  in  which  they  differ. 

The  reader  should  recall  that  these  graphs  are  not  showing 
the  shape  of  the  error  surface  near  the  correct  solution,  and  the 
non-monotonic  behavior  does  not  mean  that  minimization  will  be 
difficult. 

The  first  graphical  display  for  the  bias  experiment  is  in  fig¬ 
ure  4,  and  involves  the  bias  of  the  measures  with  respect  to  the 
parameter  exy,  the  squareness-roundness  in  the  xy  direction.  The 
first  thing  that  should  strike  you  about  the  graph  is  that  all  of 
the  measures  clump  together,  and  none  of  them  show  a  particu¬ 
larly  smooth  bias.  There  is  a  general  trend  toward  having  a  large 
measure  for  cIV,  but  locally  it  is  far  from  monotonic.  Thus,  the 
result  of  the  minimization  process  on  all  the  measures  will  gen¬ 
erally  report  some  value  which  overestimates  the  error  (even  if 
it  could  be  scaled)  and  the  overestimate  is  a  generally  increasing 
(but  nonmonotonic)  function  ely. 

One  interpretation  of  the  clumping  of  the  measures  is  that 
the  observed  bias  is  caused  by  measurement  in  the  radial  direction 
when  the  sensor  error  is  in  an  axial  direction.  This  interpretation 
suggests  that  no  radial  measure  will  be  bias-free  and  provides  an 
impetus  to  attempt  to  derive  measures  in  the  sensor  direction. 

The  scales  of  the  measured  values  were  generally  close  to  the 
RMS  error,  with  EOF4  very  close  (by  accident),  and  EOF3  EOF4 
off  by  a  factor  of  2.  Although  not  evident  from  the  examples 
presented  here,  EOF3  is  greatly  effected  by  scaling.  If  all  sizes  in 
the  example  were  multiplied  by  a  factor  of  100  (say  meters  into 
mm)  then  EOF3  would  increase  by  a  factor  of  1000000.  On  the 
otherhand,  EOF\  would  decrease  (by  a  much  smaller  amount)  as 
the  scale  was  increased. 

The  second  example  in  the  bias  determination  experiment 
is  shown  in  figure  5.  The  most  striking  feature  of  this  graph 
radically  different  behavior  of  the  measures.  Note  that  EOF4 
follows  the  RMS  error  rather  well,  showing  that  it  is  not  biased 
with  respect  to  e„.  EOF2  and  EOF3  (which  differ  only  in  scale) 
show  the  same  nonmonotonic  but  generalling  increasing  bias  as 
they  did  for  cxy.  (Close  examination  shows  qualitatively  similar 
small  scale  perturbations  in  EOF2,  EOF3  and  EOF4.  Again, 
these  are  probably  biases  inherent  in  any  axial  measure.)  While 
the  other  measures  show  a  bias  for  larger  values,  EOF j  has  a 
significant,  and  apparently  monotonic,  bias  for  smaller  values. 

A  second  fact  to  consider  in  this  example  is  the  relative  scales 
of  the  measure.  A  glance  at  the  scaling  factor.,  in  the  legend 
suggest  that  onl’  EOF4  is  closlv  linked  to  the  actual  RMS  error. 
The  other  measures  are  all  off  by  at  least  an  order  of  magnititude, 
or  more. 

The  third  example  in  the  bias  determination  experiment  is 
shown  in  figure  6.  The  most  striking  feature  of  this  graph  is  the 
clear  separation  of  the  measures.  EOFi  tracks  the  RMS  error 
rather  well,  showing  that  it  is  not  biased  over  this  range  of  aT. 
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EOF4  is  also  rather  unbiased,  with  the  exception  of  very  small  val¬ 
ues  of  ax,  which  have  a  negative  bias.  Both  of  the  aforementioned 
measures  show  non-monotonicit.y  in  what  bias  they  do  have,  but 
the  deviations  (when  appropriately  scaled)  from  the  RMS  error 
are  not  very  significant.  EOFi  on  the  other  hand,  shows  a  strong, 
almost  monotonic,  bias  toward  smaller  values  of  a*,  as  does  EOF3. 
The  Mas  of  EOF3  for  smaller  values  might  be  attributied  to  the 
“volume"  factor  which  is  the  only  difference  between  it  and  EOF 2. 
While  not  shown  here,  the  above  patterns  are  still  present  when 
the  sensor  direction  is  changed  to  the  z  axis,  although  the  rates 
of  change  of  the  bias  (i.e.  slopes  of  curves)  are  smaller. 

The  final  example  from  the  bias  experiment,  is  shown  in  fig¬ 
ure  T.  Again  there  is  a  rather  noticeable  difference  between  the 
measures.  This  time  it  is  EOF\  which  display  virtually  no  particu¬ 
lar  bias,  although  there  are  some  obvious  local  non-monotonicities. 
The  other  three  measures  all  show  a  strong  bias  for  larger  az  val¬ 
ues,  with  the  bias  in  EOF3  slightly  less  than  in  the  other  measures. 
While  one  might  have  predicted  a  bias  in  EOF3  for  small  values 
of  az,  the  greater  bias  of  the  underlying  EOF 3  factor  dominates 
the  “volume”  term  (but  the  latter  probably  provides  the  mitigat¬ 
ing  factor  in  the  bias  as  compared  with  EOF 2  and  EOF4).  To 
the  eye,  the  bias  trend  in  all  but  EOFi  appear  linear,  and  antic¬ 
ipation/compensation  might  be  possible.  Again,  the  experiments 
with  the  sensor  direction  along  the  z  axis  have  similar  overall  char¬ 
acteristics,  though  the  slopes  of  the  bias  “curves”  were  smaller. 

As  mentioned,  there  were  also  numerous  test  cases  included 
in  the  experiment  which  were  not  reported  here.  To  summarize 
the  results,  little  characteristic  difference  was  found  to  be  caused 
by  changes  in  the  sensor  direction,  although  many  minor  varia¬ 
tions  did  occur.  As  mentioned  above,  many  of  the  measures  also 
displayed  a  bias  in  terms  of  overall  scale  of  the  problem. 

A  final  comment  about  the  bias  experiments.  The  graphical 
presentation  makes  it  very  easy  to  overlook  the  absolute  scal¬ 
ing  parameters  in  the  legend.  However,  if  one  were  interested  in 
converting  from  one  measure  into  an  approximate  RMS  measure, 
even  knowing  the  bias  characteristic  would  not  be  useful  unless 
one  could  also  compute  the  scaling  factors.  As  shown  in  the  next 
section,  many  of  the  measures  can  have  significant  scale  changes 
in  the  neighboorhood  of  the  correct  solution,  thus  estimation  of 
the  scale  factors  from  this  experiment  is  unlikely. 


3.5.2  Analysis  of  cross  sections  of  error-of-fit  surface 

This  section  discusses  experiments  which  attempts  to  determine 
the  structure  of  the  error-of-fit  surface  by  examining  one  dimen¬ 
sional  slices  through  that  multi-dimensional  surface.  The  exper¬ 
iment  considered  botti  the  structure  in  a  small  neighborhood  of 
the  correct  parameter  value  and  the  larger  scale  structure  (say 
variations  of  10-100%).  The  local  structure  provides  insight  into 
how  the  measures  err  (i.e.  it  shows  location  of  measures  minima). 
The  large  scale  structure  provide  insight  into  the  likelyhood  of  suc¬ 
cessful  minimization  when  reasonable  initial  estimates  are  used  as 
starting  values.  Because  global  properties  are  of  more  importance, 
only  the  experiments  using  a  wide  range  of  values  will  be  reported 
here,  even  though  some  of  the  conclusion  are  more  readily  made 
by  examination  of  the  experiments  in  small  neighborhoods. 

As  in  the  reporting  of  the  bias  experiments,  the  presentation 


will  be  graphical  with  commentary.  Each  graph  was  generated  as 
follows: 

Step  1  Parameters  defining  SE  and  sensor  direction  are  chosen. 
Using  these  parameters,  500  points  are  generated  on  the 
surface  of  the  SE  assuming  a  uniform  pseudo-random  dis¬ 
tribution  of  angles  on  the  viewing  hemisphere  for  the  cho¬ 
sen  sensor  direction.  For  each  data  point,  a  pseudo-random 
noise  value  is  computed  assuming  a  uniform  distribution  in 
a  predetermined  range.  The  noise  is  added  to  the  data  point 
in  the  direction  of  the  sensor.  The  value  is  directly  added 
to  the  variable  used  to  compute  the  RMS  measure. 

Step  2  A  parameter  is  chosen  to  vary,  as  well  as  the  range  of  vari¬ 
ation.  For  each  iteration  point  (10-20  spanning  the  chosen 
range)  the  experimental  values  for  each  measure  are  gen¬ 
erated  using  the  current  value  of  the  variable  parameter, 
and  the  correct  values  of  the  remaining  parameters,  and  are 
recorded. 

Step  3  For  each  measure  (including  RMS)  the  recorded  values 
are  examined  and  the  maximum  of  a  measure  is  used  to  re¬ 
scale  its  value  into  the  interval  [0,1]  and  the  value  is  entered 
on  the  graph. 

Note  the  above  procedures  result  in  the  correct  location  of 
the  minima  occuring  in  the  center  of  each  graph. 

Probably  the  most  surprising  result  is  that  when  rotations 
and  translations  are  ignored  (as  in  the  results  reported  here),  all 
of  the  measures  generated  relatively  concave  cross  sections.  Thus, 
minimization  of  any  of  them  (given  correct  rotations  and  trans¬ 
lations)  should  be  straight  forward.  The  main  difference  between 
the  measures  was  the  degree  of  concavity,  which  would  effect  the 
efficiency  of  the  minimization. 

The  first  presented  crossection,  figure  8,  shows  the  cross  sec¬ 
tion  as  frz  is  varied.  The  clear  separation  of  EOF4 ,  and  higher 
curvature  suggest  that  it  is  the  preferred  measure  in  this  example. 
Note  that  since  az,ay  and  az  are  fixed,  the  structure  of  F-OFj 
and  EOF3  is  identical;  they  differ  only  in  a  scaling  factor.  Fi¬ 
nally,  EOFi  does  show  some  concavity,  but  the  dramatic  scale 
differences  make  the  minima  very  shallow.  While  not  shown,  the 
separation  of  the  measures  is  even  greater  when  the  errors  are 
along  the  z  axis. 

It  is  worth  pointing  out  that.  EOF3,  EOF3  and  EOF 4  all 
have  their  minima  to  the  left  (i.e.  smaller  parameter  value)  than 
the  underlying  surface,  while  EOFi  overestimates  the  parameter. 
(The  detailed  analysis  of  the  local  region  which  is  not  presented 
shows  that  the  minima  of  EOF4  is  the  closest  to  the  parameter 
value  defining  the  actual  underlying  surface.)  The  reader  might 
want  to  compare  this  with  the  results  of  the  bias  experiment  for 
the  same  parameter. 

The  second  cross  section  presented  in  figure  9,  shows  all  of  the 
measures  are  very  asymetrir  about  their  minima.  In  addition,  al¬ 
though  it  cannot  be  seen  from  this  graph,  all  of  the  measures  have 
their  minima  above  the  “correct”  value.  The  dramatic  changes  in 
scale  for  EOFi  make  it  appear  fiat  on  this  graph,  but  it  does  have 
a  shape  similar  to  the  other  measures,  just  more  dramatic. 

It  is  also  obvious  (though  barely  visible)  that  l(Ox  has  the 
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sharpest  minima,  although  minimization  of  any  of  the  measures 
with  a  starting  value  above  the  true  solution  looks  like  a  good 
idea.  Again,  EOF 4  has  its  minima  closest  to  the  correct  value. 
The  asymmetric  shape  of  this  curve  is  even  more  pronounced  when 
the  errors  in  the  data  are  along  the  z  axis. 

A  comment  also  is  in  order  regarding  the  vast  differences  in 
scales  between  the  measures.  As  one  can  see  from  the  legend  in 
figure  9,  the  relative  scales  differ  by  orders  of  magnitude.  From 
a  practical  point  of  view,  those  measures  which  routinely  have 
a  wide  numerical  range  (EOFi,  EOFj  and  depending  on  scaling 
possibly  EOF3)  present  more  numerical  problems,  and  show  a 
greater  sensitivity  to  roundoff  errors. 

A  third  cross  section,  see  figure  9,  shows  a  case  where  the 
relative  shape  of  all  of  the  measures  is  almost  identical,  with  EOF4 
just  slightly  outperforming  the  other  measures  in  the  concavity 
analysis.  Also,  note  that  all  of  the  measures  again  overestimate 
the  correct  underlying  parameters  (by  about  l%-2%).  When  the 
error  was  in  the  direction  of  the  z  axis,  all  measures  were  very 
sharply  concave,  and  the  minima  were  very  close  to  the  correct 
value. 

The  final  cross  sections,  see  figures  11  and  11  show  the  struc¬ 
ture  of  the  error  surfaces  for  some  of  the  parameters  used  in  the 
bias  experiments.  Note  that  in  these  cases,  EOF4  is  a  clear  win- 


4  The  Current  System 

This  section  presents  some  of  the  details  of  our  current  system  for 
the  recovery  of  superquadrics  from  3-D  information.  The  basis  of 
our  system  is  the  minimization  of  EOF\.  Note  that  one  of  the  ad¬ 
vantages  of  minimizing  the  a  measure  based  on  the  inside-outside 
function  is  that  it  requires  little  extra  effort  to  incorporate  multi¬ 
ple  views,  assuming  one  knows  the  sensor  position  for  each  view 
(to  convert  points  to  a  common  coordinate  system).  However, 
such  an  approach  will  not  take  error  distributions  into  account, 
and  the  errors  in  conversion  may  be  exaggerated  by  the  fitting 
process. 

Given  3-D  information  about  a  SE  in  canonical  position,  one 
can  use  a  nonlinear  minimization  technique  to  recover  the  5  pa¬ 
rameters  needed  to  define  it.  Our  system  uses  a  Gauss-Newton 
iterative  nonlinear  least  square  minimization  technique,  (for  ex¬ 
ample  see  [Hageman  and  Young  70]).  If  the  SE  is  not  in  canonical 
position,  the  system  must  also  recover  estimates  of  the  transla¬ 
tion  and  rotation  necessary  to  put  the  information  on  the  surface 
of  a  canonical  SE.  There  are  obviously  many  approaches  to  deal 
with  the  translation  and  rotation.  The  two  most  obvious  are:  the 
use  of  a  pair  of  transforms  (one  for  translation  and  one  for  rota¬ 
tion)  and  the  use  of  a  homogeneous  transform  that  combines  both 
translation  and  rotation.  Our  system  uses  the  first  approach. 

For  the  remainder  of  this  section  let  Co  =  cost?  and  So  = 
sint).  Thus  given  a  canonical  SF,  surface  defined  as  .?  =  s(r;.w) 
with  an  inside-outside  function  /  =  f(x,y,z).  the  translated  and 


rotated  SE  solid  s  is  given  by 


s  —  RgR^RyS  +  ty  ,  (S) 

G 

where  0,d>,  and  ip  are  the  Euler  angles  expressing  pitch,  yaw  and 
roll  respectively,  RoR^Ry  are  the  associated  rotation  matrices  and 
G,G>G  are  the  translation  in  the  x,y ,  and  z  directions  respec¬ 
tively.  The  definition  of  the  rotation  matrices  and  Euler  angles 
can  be  found  in  many  standard  texts  on  computer  graphics,  e.g., 
(Rogers  and  Adams  76],  as  well  as  in  the  computer  vision  litera¬ 
ture  e.g.,  see  [Tsai  86],  or  [Bolle  and  Murthy  86]. 

Given  the  data  for  a  general  SE  the  system  minimizes: 

EOFi(  fi,e2,  axi  Or,  G>  fyi  G»  Ip)  — 

Irf=l...n  /( ■ti  1  5/i,  ,  fl ,  C2 ,  Ox,  Oy ,  0-2 1  G,  ty ,  G ,  $,  d>,  1p)^j 

(9) 

where  x,,  j/,,z,  are  the  given  data  points  from  a  single  viewpoint, 
and  where  f(xi,yi,Zi,el,c2,az,ay,a:,tI,ty,tz,0,<j>,ip)  =  f(i,y,z) 
and 
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To  employ  the  Gauss-Newton  iteration,  the  system  must  com¬ 
pute  the  Jacobian  of  the  transformation,  and  thus  also  needs  the 
partial  derivatives  of  /(x,y,z)  with  respect  to  the  11  parame¬ 
ters,  (5  shape,  3  rotation  and  3  translation),  which  were  obtained 
symbolically. 

The  initial  implementation  of  the  system  recovered  only  the 
5  shape  parameters.  Using  synthetic  data  with  up  to  105f  uni¬ 
formly  distributed  noise,  the  system  could  start  the  minimization 
procedure  from  a  canonical  position  (all  5  parameters  =  1)  and  in 
most  cases  was  still  able  to  recover  the  underlying  surface  within 
the  error  of  the  data.  Moreover,  the  convergence  was  generally 
quick,  requiring  <  15  iterations.  These  results  are  reported  be¬ 
cause,  if  though  some  other  means  one  was  able  to  compute  the 
local  coordinate  system  of  the  SE  (or  out  very  good  bounds  on 
it),  the  stability  of  the  algorithm  would  be  greatly  increased. 

When  the  system  was  extended  to  handle  1 1  parameters  (i.e. 
5  shape  and  6  position/orientation),  things  became  more  com¬ 
plicated.  For  most  of  the  examples  presented,  even  when  pre¬ 
sented  with  very  poor  estimates  for  starting  values,  the  system 
was  quickly  able  to  find  an  SE  that  had  low  "error  of  fit1'.  Un¬ 
fortunately,  the  solutions  proposed  by  the  system  often  seemed 
to  be  nonintuitive.  However,  when  examined  closely,  many  of 
these  solutions  proved  themselves  to  be  reasonable  (though  not 
the  most  reasonable)  interpretations  of  the  data.  The  differences 
could  generally  be  traced  to  the  nonintutative  behavior  of  error- 
of  fit  measure.  For  example,  when  presented  with  the  range  data 
(from  the  Utah  range  database.  [UTAH  85])  fie  a  coke  can  minus 
the  concave  portion  of  the  bottom  end,  the  system  initially  pro 
posed  a  solution  with  /:()/)  =  .0078.  However,  the  parameters 
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described  an  object  that  is  slightly  beveled  along  its  long  axis, 
and  rather  round  in  the  other  direction.  The  length  of  the  object 
was  far  larger  then  expected.  The  exaggerated  length  was  caused 
by  two  factors: 

1.  the  model  is  assumed  to  be  intersected  with  a  negative  ellip¬ 
soid  at  the  top,  and  cut  by  the  ground  plane  at  the  bottom, 
and 

2.  the  “error  of  fit”  measure  used  by  the  system  is  an  underes¬ 
timate,  and  biased  toward  larger  objects,  see  Section  3  for 
further  discussion  of  this  problem. 

When  examined  closely,  the  proposed  object  (assuming  the  vol¬ 
ume  terminates  when  it  intersects  another  object  in  the  scene) 
does  seem  to  be  a  reasonable  fit  to  the  data,  but  still  a  cylinder 
seems  intuitively  to  be  a  better  fit.  If  forced  to  look  for  a  cylin¬ 
drical  object,  the  system  finds  an  SE  with  EOF\  of  .17.  Oddly 
enough,  the  system  converged  to  a  can  more  readily  if  data  from 
the  negative  ellipsoid  was  not  removed.  As  reported  in  the  next 
section,  the  use  of  a  different  error-of-fit  measure  resulted  in  better 
recovery. 

Currently,  the  system  obtains  initial  estimates  of  the  transla¬ 
tion  parameters  from  the  centroids  of  the  original  data  and  derives 
bounds  from  sensor  information  and  overall  data  (maximum  vari¬ 
ation  in  any  data).  However,  there  are  many  problems  with  these 
estimates  especially  if  the  number  of  data  points  is  small  or  if 
the  system  is  only  given  a  partial  view  of  an  object.  These  esti¬ 
mates  are  obviously  much  better  if  multiple  views  of  the  object 
are  available.  The  estimates  also  assume  that  the  data  is  seg 
mented,  an  assumption  with  which  the  authors  feel  particularly 
uncomfortable.  The  system  derives  estimates  of  rotation  angles 
and  length  scales  from  moments  of  inertia  and  bounds  on  the 
length  parameters  from  sensor  information  and  overall  data.  Be¬ 
cause  the  moments  of  inertia  require  second  order  moments,  the 
estimates  are  plagued  with  more  difficulties  than  the  estimates  of 
the  translation  parameters.  The  estimates  of  both  the  rotation 
ami  length  scales  arc  very  poor  if  an  object  (after  segmentation) 
is  the  result  of  boolean  combinations. 

The  system  also  estimates  bounds  on  the  maximum  values  for 
each  parameter,  deriving  these  estimates  from  knowledge  of  the 
“sensor"  and  the  data  values.  If  during  the  minimization  process, 
any  parameter  attempts  to  stray  beyond  its  allowed  boundary,  the 
system  stochastically  “pushes”  it  back  toward  its  initial  value. 

4.1  A  Few  Examples  from  Our  Current  System 

The  first  example  in  figure  13,  is  a  synthetic  SE  with  noisy  syn¬ 
thetic  data  from  multiple  views.  The  actual  parameters  of  the 
superquadric  are  (1.59,  .39,  1,  2,  3,  1.5,  2.5,  3.5,  .1,  .1.  .I).5  The 
noise  in  the  underlying  object  was  uniformly  distributed  over  the 
interval  j  .15,  +.15]  and  then  added  to  the  z  (depth)  value  of  a 
point.  The  1000  data  points  were  randomly  distributed  on  the 
surfave  before  noise  was  added.  The  system  recovered  an  SE  with 
parameters  (1.8,  .3.  1.2,  3.58.  1.19,  2.5.  3.17.  .079,  .09.  .101 ).  The 

'In  this  oxampU*  an<l  hereafter  the*  parameters  will  always  l>r  pre 
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error  (as  measured  by  EOF\ )  of  the  reconstruction  wasf  .079.  The 
system  required  7  iterations  from  the  initial  values  to  find  the  so¬ 
lution. 

Figure  14,  is  the  recovery  of  the  same  synthetic  SE  as  in 
example  1.  However,  this  time  this  system  was  given  1000  data 
points  from  one  view  of  the  object.  Under  these  conditions,  the 
recovered  parameters  were  (2.09,  .67,  1,  1.94,  3.36,  1.43,  2.53, 
4.3,  .075,  .07,  .08).  The  error  (as  measured  by  EOF\)  of  the 
reconstruction  was  .169.  Unsurprisingly,  the  reconstruction  from 
multiple  views  is  superior. 

The  final  three  examples  presented,  show  the  fitting  of  ac¬ 
tual  range  data  from  the  Utah  range  database.  Figure  15  shows 
the  elliptical  indentation  on  the  bottom  of  a  soda  can.  The  object 
which  is  defined  by  590  data  points.  The  system  recovered  the  pa¬ 
rameters  (1.29,  .955,  .939,  .930,  .277,  .096,  -1.55,  2.07,  -.05,0,0), 
and  had  EOFj  =  .0293.  Figure  16  shows  a  quasi-spherical  object 
which  is  defined  by  859  data  points.  The  system  recovered  the 
parameters  (.994,  .951,  1.29,  2.13,  2.13,  .4729,  1.437,  -1.457, -.05, 
-.04,  0)  and  EOF\  =  .021.  Figure  17  shows  the  cylindrical  por¬ 
tion  of  a  soda  can  defined  by  1645  data  points.  When  using  the 
above  described  estimations  techniques,  the  system  recovered  the 
parameters  (  2.0,  .88,  1,  1.3,  19.59,  -.19,  -1.69,  -.834,  -.04, .05,  0) 
and  with  EOF\  =  .0079.  When  the  inside-outside  function  was 
modified  to  include  an  extra  multiplicative  factor  of  aIa!/a2,  the 
result  was  the  bottom  object  in  figure  17. 

5  Conclusions  and  Future  Direction 

The  main  result  in  this  paper  is  that  there  are  problems  and 
biases  with  currently  used  error-of-fit  measures,  as  well  as  with 
two  newly  proposed  measures.  Still,  the  newly  proposed  measures 
often  fared  better  than  currently  used  measures 

The  analysis  also  suggests  that  even  better  measures  await 
definition,  especially  error  of  fit  measures  which  make  use  of  knowl¬ 
edge  of  a  sensor  error  model  (direction). 

The  paper  also  summarizes  results  which  demonstrate  that 
even  with  a  poor  EOF  measure  the  system  can  recognize  both 
positive  and  negative  superquadrics  from  depth  data  on  only  part 
of  the  surface.  (The  latter  an  important  consideration  if  SE’s  are 
to  be  used  with  USG  operations.) 

Future  plans  for  the  system  also  include  extensions  to  incorpo¬ 
rate  surface  derivative  information.  This  will  be  accomplished  by 
minimizing  a  sum  with  (some  variant,  of)  both  the  inside-outside 
function  and  a  differentiated  form  of  the  inside-outside  function. 

Of  course,  one  of  the  most  important  avenues  for  future  re¬ 
search  will  be  attacking  the  segmentation  problem,  our  current 
plans  are  to  attempt  at  least  two  approaches:  pure  growing  of 
superquadrics  from  small  data  patches,  and  a  skeletonization  (to 
find  axis)  followed  by  both  growing  and  splitting  of  snperquadric 
solids. 

A  final  avenues  of  research  will  be  in  using  SE’s  for  Integra 
tion  of  m  n  I*  i -sensor  (possible  multi  modal)  informal  ion,  including 
some  model  of  sensing  errors.  Part  of  this  work  will  undoubtedly 
look  into  the  use  of  edges  in  an  intensity  image  /range  image  as 
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they  relate  to  limb-equations. 
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Abstract:  In  this  paper,  we  study  the  differential  ge¬ 
ometry  of  straight  homogeneous  generalized  cylinders 
(SHGC’s).  We  derive  a  necessary  and  sufficient  condi¬ 
tion  that  a  SHGC  must  verify  to  parameterize  a  regular 
surface,  compute  the  Gaussian  curvature  of  a  regular 
SHGC,  and  prove  that  the  parabolic  lines  of  a  SHGC 
are  either  meridians  or  parallels.  Using  these  results, 
we  address  in  the  second  part  of  the  paper  the  following 
problem:  Under  which  conditions  can  a  given  surface 
have  several  descriptions  by  SHGC’s?  We  prove  several 
new  results.  In  particular,  we  prove  that  two  SHGC’s 
with  the  same  cross-section  plane  and  axis  direction  are 
necessarily  deduced  from  each  other  through  inverse 
scalings  of  their  cross-sections  and  sweeping  rule  curve. 
We  extend  Shafer’s  pivot  and  slant  theorems.  Finally, 
we  prove  that  a  surface  with  at  least  two  parabolic 
lines  has  at  most  three  different  SHGC  descriptions, 
and  that  a  surface  with  at  least  four  parabolic  lines 
has  at  most  a  unique  SHGC  description. 

Introduction 

A  generalized  cylinder  [Binford,  1971]  is  the  solid  ob¬ 
tained  by  sweeping  a  planar  region,  its  cross-section , 
along  a  space  curve,  its  axis,  or  spine.  The  axis  is  not 
necessary  straight,  or  even  planar;  the  cross-section  is 
not  necessarily  circular,  or  even  constant;  its  deforma¬ 
tion  is  governed  by  a  sweeping  rule. 

Generalized  cylinders  have  been  extensively  used  to 
represent  three-dimensional  objects  in  computer  vision 
[Agin,  1972],  [Brooks,  1981],  [Nevatia,  1982],  [Ponce  et 
al.,  1087],  [Binford  et  al.,  1PQ7],  [Horaud  and  Brady, 
1987].  The  most  successful  vision  system  to  date  us- 
ing  generalized  cylinders  as  its  primary  representa- 
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tion  for  three-dimensional  objects  is  probably  Acronym 
[Brooks,  1981].  The  generalized  cylinders  used  in 
Acronym  are  very  simple  (circular  or  regular  polygo¬ 
nal  cross-section,  linear  or  bi-linear  sweeping  rule).  To 
develop  a  vision  system  using  a  richer  class  of  gener¬ 
alized  cylinders,  it  is  necessary  to  understand  the  ge¬ 
ometry  of  these  primitives,  whether  a  same  object  may 
be  described  by  different  primitives  or  has  a  unique  de¬ 
scription,  and  how  these  primitives  project  into  images. 

Shafer,  in  his  pioneering  work  [Shafer,  1985],  addressed 
these  three  problems  for  the  case  of  straight  homo¬ 
geneous  generalized  cylinders  (SHGC’s,  see  [Shafer, 
1985]),  obtained  by  scaling  an  arbitrary  cross-section 
along  a  straight  axis.  He  derived  expressions  for  the 
surface  and  the  normal  of  a  SHGC  and  proved  sev¬ 
eral  uniqueness  results  and  properties  of  the  contours 
of  SHGC’s.  This  paper  is  one  in  a  series  of  articles  that 
study  the  geometry  of  straight  homogeneous  general¬ 
ized  cylinders  in  an  effort  to  extend  Shafer’s  work.  In 
[Ponce  and  Chelberg,  1987]  and  [Ponce,  Chelberg,  and 
Mann,  1987],  we  have  studied  the  contours  of  SHGC’s, 
and  proved  new  properties  of  these  contours  which  are 
invariant  with  respect  to  the  viewing  direction. 

In  this  paper,  we  study  the  differential  geometry  of 
SHGC’s,  and  use  our  conclusions  to  derive  new  unique¬ 
ness  results.  In  section  1,  we  prove  that  the  surface  of 
a  general  SIIGC  is  not  always  regular  (among  other 
things,  it  does  not  always  have  a  well-defined  tangent 
plane  everywhere),  and  derive  necessary  and  sufficient 
conditions  of  regularity.  We  derive  in  section  2  an  ex¬ 
pression  for  the  Gaussian  curvature  of  a  SHGC  and 
prove  that  the  parabolic  lines  of  a  SHGC  are  always  ei¬ 
ther  meridians  or  parallels.  In  the  following  section,  we 
define,  following  Shafer,  classes  of  equivalent  SHGC’s, 
which  trivially  define  the  same  surface.  We  prove  the 
new  converse  result  that  if  a  surface  is  described  by  two 
SHGC’s  with  the  same  axis  and  cross-section  plane, 


then  these  two  SHGC’s  are  equivalent. 

In  section  4,  we  use  the  result  of  section  1  on  SHGC’s 
parabolic  lines  to  prove  a  simple  uniqueness  result, 
stating  that  a  surface  described  by  a  SHGC  with  two 
parabolic  meridians  and  two  parabolic  parallels  does 
not  have  any  other  non-equivalent  SHGC  description. 
In  section  5,  we  extend  Shafer’s  slant  and  pivot  theo¬ 
rems.  Relaxing  some  of  his  assumptions  (intersecting 
axes,  same  radius  functions),  we  prove  that  for  a  non¬ 
linear  SHGC,  the  direction  of  the  axis  is  determined  by 
the  direction  of  the  cross-section,  and  that  the  direction 
of  the  cross-section  is  determined  by  the  direction  of  the 
axis.  Finally,  in  section  6,  we  give  our  main  unique¬ 
ness  results.  A  surface  with  at  least  two  parabolic  lines 
has  at  most  one  SHGC  description  if  these  lines  are 
parallel,  and  at  most  three  if  they  intersect.  A  closed 
surface  with  at  least  one  concave  point  has  at  most 
three  SHGC  descriptions.  A  surface  with  at  least  four 
parabolic  lines  has  at  most  one  SHGC  description. 

The  proofs  of  our  results  are  in  general  simple,  but 
rather  technical.  The  most  technical  proofs  are  given 
in  the  appendix,  and  only  the  “geometric”  proofs  of 
the  uniqueness  results  are  given  in  the  main  body  of 
the  paper. 

1.  Straight  homogeneous  generalized  cylinders 

In  this  section,  we  first  give  a  more  precise  definition  of 
straight  homogeneous  generalized  cylinders,  and  then 
study  their  differential  geometry.  We  introduce  the  no¬ 
tion  of  regular  SHGC  and  give  a  necessary  and  sufficient 
condition  or  regularity.  Informally,  a  straight  homoge¬ 
neous  generalized  cylinder  (SHGC,  see  [Shafer, 1985]) 
is  obtained  by  scaling  a  planar  cross-section  along  a 
straight  axis.  Let  us  give  first  a  more  formal  definition. 

Definition  1.1:  Let  xq  be  a  point  of  R 3,  P  =  (i,j,k) 
an  orthonormal  basis  of  R3,  a  =  (p,  q)  a  vector  of 
R7.  and  S  and  C  two  simple  regular  planar  curves  re¬ 
spectively  parameterized  by  ( z ,  r)  :  1  C  R  —*  R7  and 
{x,y)  :  J  C  R  —*  R 2,  where  I  and  J  are  open  intervals 
of  R.  The  parameterized  surface  x  :  I  x  J  -*  R3  defined 
in  the  affine  coordinate  system  (xo,i,j,k)  by 

fX(t))  (P\ 

x(s,t)  =  r(s)  I  y{t)  j  +  z(s)  I  q  J  ,  (1.1) 

is  the  straight  homogeneous  generalized  cylinder  asso¬ 
ciated  with  the  five-tuple  (x<j,  P,  a,  S,  C).  The  point  xo 
is  the  origin  of  the  SHGC,  P  is  ( the  coordinate  sys¬ 
tem  associated  with)  its  cross-section  plane,  and  a  is 
the  direction  of  its  axis.  The  parameterized  curves  C 


and  S  are  respectively  the  reference  cross-section  and 
the  sweeping  rule  of  the  SHGC. 

Notice  that  this  formulation  is  very  general.  The  shape 
of  the  reference  cross-section  is  completely  arbitrary. 
Similarly,  the  shape  of  the  sweeping  rule  curve  is  arbi¬ 
trary.  In  particular,  r  is  not  necessarily  a  function  of 
z.  The  axis  is  not  necessarily  orthogonal  to  the  cross- 
section  plane,  the  only  restriction  is  that  it  does  not 
lie  in  this  plane.  Most  authors  have  used  a  more  re¬ 
stricted  definition  of  SHGC’s.  Marr  [1977]  mostly  re¬ 
stricts  his  study  of  SHGC’s  (which  he  calls  generalized 
cones)  to  SHGC’s  with  convex  cross-sections.  In  previ¬ 
ous  papers  [Ponce  and  Chelberg,  1987],  [Ponce,  Chel- 
berg,  and  Mann,  1987],  we  mainly  restricted  our  study 
to  star-shaped  cross-sections.  Shafer  [1985]  considers 
an  arbitrary  cross-section,  but  assumes  as  the  others 
that  the  scaling  is  given  as  a  function  of  z.  This  is 
a  restricting  assumption,  as  for  example,  a  torus  ran 
only  be  described  as  a  SHGC  when  both  r  and  z  are 
parameterized. 

Let  us  now  examine  under  which  conditions  the  sur¬ 
face  associated  to  a  SHGC  is  regular,  i.e. ,  in  partic¬ 
ular,  it  is  smooth,  has  no  self-intersections,  and  has 
a  well-defined  tangent  plane  at  each  of  its  points  (see 
(do  C armo,  1976]).  This  is  important  as  most  con¬ 
cepts  of  differential  geometry  only  apply  to  regular  sur¬ 
faces.  For  example,  it  is  well  known  (see  [do  Carmo, 
1976])  that  a  solid  of  revolution,  defined  as  above  with 
x  =  cost, j/  =  sinf,  is  regular  as  long  as  r  is  strictly 
positive.  We  are  going  in  all  this  paper  to  make  the 
following  assumptions  (this  is  similar  to  Shafer’s  defi¬ 
nition  of  “non-degenerate”  SHGC’s). 

Assumptions:  To  avoid  degenerate  cases,  we  assume 
in  all  this  paper  that:  (a.l):  r  is  strictly  positive;  (a. 2): 
the  sweeping  rule  curve  is  not  an  horizontal  line  seg¬ 
ment  (i.e.,  z'  is  not  identically  0);  and  (a. 3):  the  cross- 
section  curve  is  not  a  straight  line  segment  whose  sup¬ 
porting  line  goes  through  the  origin  (i.e.,  x'y  —  y'x  is 
not  identically  0). 

Under  these  assumptions,  we  have  the  following  result. 

Proposition  1.1:  The  surface  associated  with  a  SHGC 
is  regular  iff  at  least  one  of  the  two  following  con¬ 
ditions  is  verified:  (a):  Vs  6  /,  z'(s)  ^  0,  ( b ): 

Vf  £  J ,  x'(t)y(t)  -  y'(t)x(t)  /  0. 

The  detailed  proof  can  be  found  in  the  appendix.  In 
particular,  for  a  regular  SHGC,  liie  tangent  plane  is 
defined  everywhere,  and  the  (non-normalized)  normal 


n  to  the  surface  at  a  point  x(s,  t)  is  given  by 


vature  is 


n  =  x(  x  x,  =  rr'(x'y  -  y  x)  0  +  rz'  j  -x' 

\i  /  \ 

(1.2) 

We  prove  in  the  appendix  that  if  both  conditions  (a) 
and  (b)  are  false,  then  the  surface  is  not  regular  because 
there  is  at  least  one  point  where  the  normal  is  not  de¬ 
fined.  In  fact,  it  is  easy  to  show  that  if  z  and  x/y  are 
both  extremal,  then  the  surface  has  self-intersections. 
A  geometric  interpretation  of  the  previous  proposition 
is  given  by  the  following  immediate  corollary. 

Proposition  1.2:  The  surface  associated  to  a  SHGC  is 
regular  iff  the  sweeping  curve  can  globally  be  parameter¬ 
ized  as  a  function  r  of  z,  or  the  reference  cross-section 
can  globally  be  parameterized  in  polar  coordinates  as  a 
function  p  of  0. 

Notice  that  a  closed  cross-section  can  be  parameterized 
in  polar  coordinates  iff  it  is  star-shaped.  There  exist 
open  curves  which  can  be  parameterized  in  polar  coor¬ 
dinates  but  are  such  that  the  line  between  a  point  and 
the  origin  intersects  the  curve  several  times  (e.g.,  an 
exponential  spiral). 

The  above  proposition  has  practical  consequences.  Non 
star-shaped  SIIGC’s  and  folding  SHGC’s  (i.e.,  SHGC’s 
for  which  r  cannot  be  written  as  a  function  of  z)  are 
worth  considering  for  vision  and  modelling  purposes, 
as  they  describe  a  wide  class  of  objects  However, 
SHGC’s  which  have  both  a  folding  sweeping  rule  curve 
and  a  non-star-shaped  cross-section  are  ill-defined  ob¬ 
jects,  and  there  is  no  point  including  them  in  a  vision 
or  a  geometric  modelling  system  (e.g.,  as  the  one  de¬ 
scribed  in  [Ponce  et  a!.,  1987]). 

2.  Intrinsic  geometry  of  SHGC’s 

We  now  turn  to  the  intrinsic  geometry  of  the  sur¬ 
faces  associated  with  straight  homogeneous  generalized 
cylinders.  We  compute  the  Gaussian  curvature  of  a 
regular  SHGC,  and  prove  that  the  parabolic  lines  of  a 
SHGC  are  always  meridians  or  parallels. 

Proposition  2.1:  The  parabolic  lines  of  a  SIIGC  are 
its  meridians  which  correspond  to  points  of  its  refer¬ 
ence  cross-section  where  the  tangent  is  aligned  with  the 
vector  (x,y)  or  the  curvature  is  zero ,  and  its  parallels 
which  correspond  to  points  of  its  sweeping  rule  curve 
where  the  tangent  is  parallel  to  the  cross-section  plane 
or  the  curvature  is  zero. 

The  proof  of  the  proposition  can  be  found  in  the  ap¬ 
pendix.  In  particular,  we  prove  that  the  Gaussian  cur- 


K  =  -r"z'){x"y' -y"x')z\x'y~y'x).  (2.1) 

The  proposition  follows  easily  from  this  expression. 

Notice  that  a  regular  SHGC  cannot  have  both  parabolic 
parallels  where  r'(s)  —  0  and  parabolic  meridians 
where  (x'y  —  y'x){l)  =  0.  However,  it  may  have  at  the 
same  time  parabolic  parallels  corresponding  to  zeros 
of  curvature  of  the  sweeping  rule  and  parabolic  meridi¬ 
ans  corresponding  to  zeros  of  curvature  of  the  reference 
cross-section.  Related  work  can  be  found  in  [Brady  et 
al. ,  1985].  They  prove  that  the  only  meridians  and 
parallels  of  a  SHGC  which  are  lines  of  curvature  are 
the  extremal  ones,  called  flutings  and  skeletons  (sec 
also  [Marr,  1977]).  As  shown  in  [Ponce,  Chelberg,  and 
Mann,  1987],  this  result  is  only  valid  for  right  SHGC’s. 

A  first  consequence  of  this  expression  of  the  Gaussian 
curvature  is  that  the  zeros  of  curvature  of  the  contours 
of  SHGC’s  correspond  to  points  where  the  curvature 
of  the  sweeping  rule  or  the  cross-section  is  0,  or  where 
z'(s)  =  or  ( x'y  —  y'x)(t)  =  0.  This  is  due  to  a  theorem 
by  Koenderink  [1984],  which  relates  the  curvature  of 
the  contours  of  an  object  to  its  Gaussian  curvature. 
This  result  can  also  be  proved  directly.  See  [Ponce, 
Chelberg,  and  Mann,  1987]  for  details. 

In  this  paper,  the  most  important  consequence  of  this 
proposition  is  that  it  can  be  used  to  prove  that  surfaces 
with  parabolic  lines  have  a  (small)  finite  number  of 
possible  SHGC  descriptions  (see  section  4  and  6). 

3.  Equivalent  SHGC’s 

Assumptions  (a.  1)  to  (a. 3)  were  sufficient  to  define 
straight  homogeneous  generalized  cylinders.  We  are 
now  going  to  look  at  alternate  descriptions  of  objects 
by  SHGC’s,  and  we  will  need  the  two  additional  as¬ 
sumptions. 

Assumptions:  We  suppose  from  now  on  that:  (a. 4): 
the  sweeping  rule  curve  of  a  SIIGC  does  not  contain 
horizontal  line  segments  (i.e.,  there  is  no  non-empty 
open  interval  such  that  is  zero  on  that,  interval),  and 
(a.5):  the  reference  cross-section  does  not  contain  line 
segments  going  through  the  origin  (i.e..  there  is  no  non¬ 
empty  open  interval  such  that  x'y  -  y'x  is  zero  on  that 
interval). 

Notice  that  for  a  given  regular  SHGC,  we  just,  need  one 
of  these  assumptions,  as  the  other  one  will  be  auto¬ 
matically  verified.  A  straight  homogeneous  generalized 
cylinder,  as  defined  earlier,  is  a  parameterization  of  a 
surface  and  not  the  surface  itself.  In  particular,  two 
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lels  in  any  othe,  description  of  the  surface  by  a  SIIGC. 


different  SHGC’s  may  parameterize  the  same  surface. 
Following  Shafer  [1985],  we  now  introduce  the  notion 
of  equivalent  SHGC’s. 

Definition  3.1:  Two  SHGC’s  are  said  to  be  equivalent 
if  they  can  be  deduced  from  each  other  through  any  se¬ 
quence  of  the  following  transformations:  changing  the 
parameterization  of  the  sweeping  rule  or  the  reference 
cross-section;  translating  the  origin  along  the  axis  by  a 
given  constant  while  translating  z  by  the  opposite  of  that 
constant;  rotating  the  vectors  (i,  j)  around  the  vector  k 
by  a  given  angle  while  rotating  ( x ,  y)  and  a  by  the  oppo¬ 
site  angle;  changing  the  orientation  of  the  cross-section 
plane,  the  axis,  and  the  sweeping  rule  and  cross-section 
curves;  scaling  the  sweeping  rule  by  a  non-zero  constant 
while  scaling  the  reference  cross-section  by  the  inverse 
of  that  constant. 

Trivially,  two  equivalent  SHGC’s  parameterize  the 
same  surface,  as  already  remarked  by  Shafer.  We  can 
in  fact  prove  the  converse  result. 

Proposition  3.1:  The  surfaces  associated  with  two 
SHGC’s  with  the  same  (non-oriented)  cross-section 
plane  and  (non-oriented)  axis  are  the  same  iff  the  two 
SHGC’s  are  equivalent. 

Notice  that,  trivially,  the  relation  defined  by  the  equiv¬ 
alence  of  two  SHGC’s  is  a  relation  of  equivalence.  From 
now  on,  a  SHGC  will  be  a  class  of  equivalence  of  this 
relation,  represented  by  any  member  of  the  class.  This 
allows  us  to  talk  about  the  cross-section  plane,  the  axis, 
the  meridians  and  parallels  associated  to  a  SHGC  in¬ 
dependently  of  the  coordinate  system  and  orientation 
chosen  to  describe  them. 

4.  A  simple  sufficient  uniqueness  condition 

We  now  have  all  the  tools  to  prove  a  first,  simple 
uniqueness  result.  We  are  going  to  use  the  (immedi¬ 
ate)  following  properties  of  the  parallels  and  meridians 
of  a  SHGC. 

Properties:  The  sweeping  rule  (respectively  reference 
cross-section)  of  a  SHGC  is  not  linear  iff  there  exists 
a  non-linear  meridian  (respectively  parallel).  A  non¬ 
linear  meridian  (respectively  parallel)  uniquely  deter¬ 
mines  a  plane  in  which  it  lies.  The  planes  of  any  two 
non-linear  meridians  intersect  on  the  axis.  The  planes 
of  a  non-linear  meridian  and  a  non-linear  parallel  in¬ 
tersect. 

Proposition  4.1:  If  a  surface  admits  a  parameteriza¬ 
tion  by  a  SHGC  whose  sweeping  rule  has  two  zeros  of 
curvature  (corresponding  to  different  values  of  z)  and 
the  cross-section  is  not  linear,  then  the  parabolic  paral¬ 
lels  corresponding  to  these  zeros  of  curvature  are  paral¬ 


Proof  of  the  proposition:  The  two  zeros  of  curva¬ 
ture  of  the  sweeping  rule  correspond  to  two  parabolic 
parallels  P<  and  P 2.  In  any  other  description,  P\  and  Po 
are  either  parallels  or  meridians.  As  the  cross-section 
is  not  linear,  Pi  and  P?  are  themselves  not  linear  and 
uniquely  determine  two  parallel  planes.  P\  and  P2  can¬ 
not  be  both  meridians  as  they  are  not  linear  and  their 
planes  don’t  intersect  (these  planes  are  parallel  and 
don’t  coincide  as  the  zeros  of  curvature  correspond  to 
different  values  of  z). 

Suppose  that  one  of  them  is  a  parallel  and  the  utli<:: 
one  is  a  meridian.  Then  in  this  description,  neither 
the  cross  section  nor  the  sweeping  rule  is  linear,  so  the 
planes  of  Pi  and  P2  should  intersect,  which  contradicts 
the  hypothesis.  It  follows  that  P\  and  Po  are  parallels 
in  any  descriptions.  ■ 

Proposition  4.2:  If  a  surface  admits  a  SHGC  descrip¬ 
tion  such  that  the  sweeping  rule  curve  is  not  linear  and 
has  at  least  two  zeros  of  curvature  (corresponding  to 
different  values  of  z)  and  the  reference  cross-section  is 
not  linear  and  has  at  least  two  (non-diametrically  op¬ 
posed)  zeros  of  curvature,  then  the  surface  admits  no 
other  different  SHGC  description. 

Proof  of  the  proposition:  From  the  previous  propo¬ 
sition,  the  two  parabolic  parallels  are  parallels  in  any 
description.  The  two  parabolic  meridians  Mi  and  .IP 
corresponding  to  the  zeros  of  curvature  of  the  reference 
cross-section  are  meridians  or  parallels  in  any  descrip¬ 
tion.  As  Mi  and  A/2  are  not  linear,  they  determine  two 
planes,  which  intersect  the  planes  of  the  two  parabolic 
parallels.  Therefore,  M 1  and  Mo  cannot  be  parallels 
and  are  meridians  in  any  description.  As  there  are 
not  linear,  their  planes  intersect  on  the  axis.  Both  the 
cross-section  plane  and  the  axis  are  given,  and  using 
Proposition  (3.1),  the  proposition  is  proved.  ■ 

Notice  that  the  above  result  is  relatively  weak.  How¬ 
ever,  a  non-convex  closed  cross-section  has  at  least  two 
zeros  of  curvature.  Similarly,  is  the  surface  of  a  SIIGC 
is  closed  and  its  sweeping  rule  is  non-convex,  then  it 
also  has  at  least  two  zeros  of  curvature.  This  fairly 
general  class  of  surfaces  admits  a  unique  SIIGC  de¬ 
scription.  To  prove  stronger  uniqueness  results,  we  are 
now  going  to  extend  some  results  due  to  Shafer. 

5.  Generalized  pivot  and  slant  theorems 

In  this  section,  we  prove  generalized  versions  of  Shafer's 
pivot  and  slant  theorems  [Shafer,  1985]. 
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Pivot  theorem:  A  non-degenerate  SHGC  can  be  de¬ 
scribed  as  another  SHGC  with  a  different  axis,  and  the 
same  cross  section  planes,  iff  it  is  linear. 

This  theorem  has  been  proved  under  the  assumption 
that  the  two  parameterizations  have  intersecting  axes. 

Slant  theorem:  A  non-degenerate  oblique  SHGC  has 
an  equivalent  right  SHGC  iff  the  radius  function  is  lin¬ 
ear. 

This  theorem  has  been  proved  under  the  assumption 
that  the  same  scaling  function  r  is  to  be  used  for  both 
parameterizations.  It  has  been  extended  by  Roberts 
[1985],  to  relax  this  assumption.  Both  theorems  have 
been  proved  under  the  additional  assumption  that  r  is  a 
function  of  z.  In  this  section,  we  are  going  to  give  more 
general  statements,  valid  for  general  SIIGC’s,  without 
the  above  restrictions. 

Proposition  5.1:  The  surface  associated  with  a  non 
linear  SHGC  cannot  be  described  by  another  SHGC 
with  a  different  axis,  and  the  same  cross-section  plane. 

Proposition  5.2:  The  surface  associated  with  a  non 
linear  SHGC  cannot  be  described  by  another  SHGC 
with  the  same  axis,  and  a  different  cross-section  plane. 

A  detailed  proof  of  these  two  propositions  can  be  found 
in  the  appendix.  The  important  consequence  of  these 
propositions  is  that  for  a  given  surface,  and  a  given 
cross-section  plane  (respectively  a  given  axis),  there  is 
at  most  one  possible  description  by  a  non-linear  SHGC 
(class  of  equivalence). 

6.  Main  uniqueness  results 

Proposition  6.1:  A  surface  with  at  feast  two  non- 
linear  parabolic  lines  has  at  most  a  unique  description 
as  a  SHGC  if  these  lines  are  parallel,  and  at  most  three 
different  SHGC  descriptions  if  they  are  not. 

Proof  of  the  proposition:  Let  us  suppose  that  a 
surface  has  two  non-linear  parabolic  lines  L\  and  L 2. 
If  they  are  parallel  to  each  other,  then  they  are  paral¬ 
lels  in  any  description  of  that  surface  by  a  SHGC.  Using 


determines  the  cross-section  plane,  and  a  corresponding 
third  potential  description  of  the  surface  as  a  SHGC.  ■ 

A  corollary  of  this  result  is  that: 

Proposition  6.2:  A  closed  bounded  surface  with  at 
least  one  concave  point  has  at  most  one  SHGC  descrip¬ 
tion. 

Proof  of  the  proposition:  A  closed  bounded  sur¬ 
face  has  at  least  one  convex  point,  where  both  principal 
curvatures  are  strictly  positive  and  the  Gaussian  curva¬ 
ture  is  strictly  positive  (e.g.,  [do  Carmo,  pp.  172]).  By 
hypothesis,  the  surface  also  has  a  concave  point,  where 
both  principal  curvatures  are  strictly  negative  and  the 
Gaussian  curvature  is  strictly  positive.  The  surface  has 
therefore  two  parabolic  lines,  separating  one  hyperbolic 
region  and  two  elliptic  regions.  The  result  follows.  ■ 

Proposition  6.3:  A  surface  with  at  least  four  non¬ 
linear  parabolic  lines  has  at  most  one  SHGC  descrip¬ 
tion. 

Proof  of  the  proposition:  Suppose  now  that  the 
surface  has  at  least  four  non-linear  parabolic  lines.  If 
at  least  two  of  these  lines  lie  in  parallel  planes,  we  have 
just  proved  that  the  surface  has  at  most  one  SHGC 
parameterization.  In  the  remaining  case,  suppose  that 
the  surface  can  be  parameterized  by  a  SHGC.  In  this 
description,  at  most  one  of  the  lines  is  a  parallel,  and 
the  three  others  lines  li,L2,L3  intersect  on  the  axis. 

Suppose  then  that  the  surface  admits  an  other  parame¬ 
terization  by  a  SHGC.  In  this  description,  at  most  one 
of  the  Li’s  can  be  a  parallel,  as  any  two  of  these  lines 
intersect.  The  two  other  lines  are  meridians,  and  their 
intersection  gives  the  axis  of  the  second  description. 
This  axis  is  the  same  as  in  the  first  description,  and 
the  result  is  proved.  ■ 

In  particular,  consider  the  case  of  an  ellipsoid  of  revo¬ 
lution.  If  we  create  one  concave  point  by  pressing  on 
one  of  its  ends,  but  keeping  it  a  solid  of  revolution, 
the  above  results  show  that  the  resulting  object,  has  a 
unique  SHGC  description,  although  the  figure  is  still 
radially  symmetric. 

Conclusion 


Proposition  (5.1),  this  also  determines  the  axis.  If  L\ 
and  Z>2  are  not  parallel  to  each  other,  there  are  three 
possible  cases.  If  L\  and  L2  are  meridians,  the  inter¬ 
section  of  their  planes  determines  the  axis  and  there¬ 
fore  also  the  cross-sections  of  any  SHGC  description 
(Proposition  (5.2)).  If  L\  is  a  parallel  and  L2  is  a 
meridian,  then  L\  determines  the  cross-section  plane 
and  therefore  the  axis  (Proposition  5.1  again).  In  the 
remaining  case,  L2  is  a  parallel  and  L\  is  a  meridian.  L2 


We  have  proved  new  results  concerning  the  differential 
geometry  of  straight  homogeneous  generalized  cylin¬ 
ders,  and  used  them  to  derive  new  uniqueness  results 
for  these  objects.  We  plan  to  use  similar  methods  to  de¬ 
rive  similar  result  s  for  a  wider  class  of  generalized  cylin¬ 
ders.  Some  problems  that  we  will  address  are:  When 
is  the  surface  of  a  generalized  cylinder  with  a  curved 
axis  and  a  circular  cross-section  regular?  Can  we  prove 
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uniqueness  results  for  these  primitives?  We  already 
know  for  example  that  a  torus  can  be  described  by  a 
S1IGC,  but  also  by  a  generalized  cylinder  with  a  cir¬ 
cular  axis  and  a  constant  circular  cross-section.  Can  it 
be  described  by  other  generalized  cylinders?  An  other 
interesting  problem  is,  given  the  model  of  an  object, 
to  try  to  determine  (when  possible)  not  only  whether 
it  may  have  alternate  representations,  but  also  how 
many,  and  maybe  generate  these  representations  au¬ 
tomatically. 
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7.  Appendix:  proofs  of  the  propositions 

In  this  section,  we  give  the  proofs  of  propositions  (1.1), 
(2.1),  (3.1),  (5.1),  and  (5.2).  We  use  some  lemmas, 
whose  proof  is  simple  but  omit  ted  for  the  sake  of  con¬ 
ciseness. 

Proposition  l.l:The  surface  associated  with  a  SHGC 
is  regular  ifT  one  of  the  two  following  conditions  is  veri¬ 
fied:  (a):  Vs  £  /,  z'(s)  ^  0,  (h):  Vf  £  J,  r'(t)y(t)  — 

iy'(l)x(t)  ±  0, 

To  prove  this  proposition,  we  will  need  the  following 
lemma. 

Lemma  7.1:  Let  f  :  /  —  It  he  a  continuous  real  func¬ 
tion  defined  on  the  o[x?n  interval  1  C  If  If  f  is  not 
identically  0  on  /,  and  has  at  least  one  zero  on  1 ,  then 
there  exists  a  point  to  £  I  such  that  /(to)  =  0  and  that 
there  is  no  open  interval  J  included  in  I  and  containing 
t0  such  that,  for  all  t  £  J,f(t)  =  0. 

Proof  of  the  proposition:  To  prove  that  the  surfa  o 
associated  with  a  SHGC  satisfying  (a)  or  (b)  r  regular, 
it  is  sufficiccnt.  to  prove  (see  [do  Carmo,  1976.  pp.  52]) 
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that:  (1):  the  function  x  is  differentiable,  (2):  it  is  an 
homomorphism  (i.e,  it  is  continuous  and  invertible,  and 
its  inverse  is  continuous),  and  (3):  V(s,t)  G  /  x  J,  x(  x 
x,  ^  0. 

Let  us  suppose  that  (a)  or  (b)  is  satisfied.  (1)  is  triv¬ 
ial.  We  now  prove  (2).  From  now  on,  we  drop  the 
arguments  s  and  b  x  is  clearly  continuous  and  sur¬ 
jective.  To  prove  that  it  is  injective  and  that  its  in¬ 
verse  is  continuous,  let  us  consider  a  point  (X,Y,Z)‘ 
on  the  surface,  and  a  point  (s,f)  of  /  x  J  such  that 

x(s,t)  =  (A-,  v.zy. 

Let  us  suppose  first  that  the  condition  (a)  is  verified. 
In  this  case,  the  function  r  is  invertible  on  I,  and  we 
have  s  =  z~1(Z),  w-here  c_1  is  continuous  (this  is  a 
consequence  of  the  inverse  function  theorem,  e.g.,  see 
[Rudin,  1976,  pp.  221]). 


We  deduce  that  x(t)  =  (A 


l(Z)  and 


y{t)  —  (V  —  Zq)/r  o  z~l(Z),  where  “o”  denotes  the 
composition  of  two  functions.  As  (x,  y)  :  J  — *  R2  is 
a  parameterization  of  the  reference  cross-section,  t  is  a 
continuous  function  of  (A',  Y,Z). 

We  now  consider  the  case  where  (b)  is  verified.  In 
particular,  x(t)  and  y(t)  cannot  both  be  zero  in  t. 
At  a  point  where  Y  —  Zq  =  r(s)y(t)  ^  0,  we  get 
(x/y)(t)  =  (X  -  Zp)/(Y  -  Zq). 

As  (b)  is  vciifi.l,  xjy  is  invertible  in  a  neighborhood 
of  this  point  and  admits  a  continuous  inverse  (again 
a  consequence  of  the  inverse  function  theorem).  We 
get  t  =  (x/y)~l  o  ((X  —  Zp)/(Y  —  Zq)).  The  same 
reasoning  holds  in  the  neighborhood  of  a  point  where 
A'  —  Zp  ~  r(s)x(t)  ^  0,  so  t  is  a  continuous  function  of 
(A \Y,Z). 

We  have  r(s)  =  (X -Zp)/xol~l(X,  Y,  Z)  and  z(s)  =  Z. 
As  (c,  r)  :  /  —  Ii2  is  a  parameterization  of  the  sweeping 
rule  curve,  s  is  a  continuous  function  of  (A,  Y,  Z).  This 
completes  the  proof  that  x  is  an  homomorphism. 

We  now  prove  (3),  i.e.,  x(  x  x,  j!  0.  The  partial  deriva¬ 
tives  of  x  with  respect  to  s  and  t  are  respectively  given 
bv 


■  X,  =  r  y 
\  0 


Their  cross  product  is  given  by 


/  0  \  (  \ 

x,  x  x.,  =  rr'  0  +  rc'  -x'  .  (7.2) 


x'y  —  y'  x 


J-''1!  -  y'p 


From  the  previous  expression  and  the  fact  that  C  is 
regular,  so  for  all  t  6  J,  xa  +  y'2  yt  0,  we  deduce 
immediatly  that  (3)  is  satisfied. 

We  have  proved  that  if  (a)  or  (b)  is  satisfied,  then  the 
surface  is  regular.  We  now  prove  the  converse  proposi¬ 
tion.  Let  us  suppose  that  neither  (a)  nor  (b)  is  satisfied, 
i.e.,  there  exists  (uo,t>o)  G  /  x  J,  such  that  c'(!/o)  =  0, 
and  ( x'y  -  y'x)(v0)  =  0. 

As,  from  assumption  (a. 2),  z'  is  not  identically  zero, 
it  follows  form  lemma  (7.1)  that  there  is  a  point  so  € 
I  such  that  ^'(so)  =  0  and  there  is  no  open  interval 
containing  s0  such  z'  is  identically  0  on  this  interval. 
In  particular,  it  follows  that  for  all  e  >  0,  there  exists 
s(  G  /,  such  that  |sf  —  s|  <  c  and  z'(se)  /  0.  Let 
us  define  the  sequence  (s„)„>o  by  sn  =  =  The 

sequence  (s„j„>o  converges  towards  sn. 

Similarly,  using  assumption  (a. 3)  and  lemma  (7.1),  we 
can  define  a  point  t0  G  J  and  a  sequence  ( f ,, ) n > o  such 
that  (x'y- y'x)(t0)  =  0,  for  all  n  >  0,  (x'y- y'x)(t„)  ^ 
0,  and  the  sequence  (t„)n>o  converges  towards  t0. 

At  a  point  (sn,t0),  the  normal  is  parallel  to  the  con¬ 
stant  vector  (y',—x',x'q  —  y'p)1.  At  a  point  (s0,/rl), 
the  normal  is  parallel  to  the  constant  vector  (0,0,  1)'. 

If  the  surface  is  regular,  then  it  is  locally  orientable 
in  x(so,<o).  and  the  normal  is  well-defined  and  con¬ 
tinuous  in  a  neighborood  of  this  point  (e.g.,  see  [do 
Oarmo,  1976,  pp.  136]).  In  particular  the  sequences 
(n(sn,  <o))n>o  and  (n(s0> <n))n>0.  (where  n(s,t)  de¬ 
notes  the  normal  to  the  surface  at  the  point  x(s,/)) 
should  converge  toward  a  same  direction.  But  these 
sequences  are  constant  and  parallel  to  independent  vec¬ 
tors.  This  contradicts  the  hypothesis,  and  the  result  is 
pioveU.  ■ 

Proposition  2.1:The  parabofic  lines  of  a  SIIGC  un¬ 
its  meridians  which  correspond  to  jioints  of  its  refer¬ 
ence  cross-section  where  the  tangent  is  aligned  with  the 
vector  (x,y)  or  the  curvature  is  zero,  and  its  parallels 
which  correspond  to  points  of  its  sweeping  rule  curve 
where  the  tangent  is  parallel  to  the  cross-section  plane 
or  the  curvature  is  zero. 

Proof  of  the  proposition:  We  compute  t  he  Gaussian 
curvature  of  a  SIIGC.  The  second  partial  derivatives  of 
x  are  given  by 


x,,  =  r 


y  + 


Xu  =  r  y" 

\  0 


KS 


m 

s 


The  Gaussian  curvature  of  the  surface  is  given  by 


eg  -  f2 

~  EG  -  F2' 


(7.4) 


where  E ,  F,  G  and  e,  /,  g  are  respectively  the  coeffi¬ 
cients  of  the  first  and  second  fundamental  forms  in  the 
basis  (x, ,  x(). 

It  is  well  known  (e.g,,  [do  Carmo,  1976,  pp.  98]),  that 
EG  —  F2  =  |x,  x  x(|2.  Let  us  denote  the  vector  x(  x  xs 
by  n.  n  is  the  (non-normalized)  normal  to  the  surface. 
The  coefficients  e,/,  g  are  given  by  the  following  for¬ 
mulas 


« =  T~rn  •  =  T3r(2"r'  -  r"z')ix'y  -  y'x)’ 

n  n 


f  ~  R“  'x,‘  =  °’ 

g  =  j-Tn  =  - G'rJ(rV  -  !*"*') 

II  n 


(7.5) 


So  I\  s  given  by 

A-=  ^(z"r'-r"z'){x"y'-y"x')z\x'y-y'x).  (7.6) 


But  the  curvature  of  a  planar  curve  (u,  v)  :  1  —>  R2  is 
given  by 


K 


(U'2  +  v  '2)3/2- 


(7.7) 


The  proposition  therefore  follows  from  the  above  ex¬ 
pression  of  A'  .  ■ 

We  are  now  going  to  prove  the  propositions  which  relate 
two  descriptions  of  a  same  surface  by  different  SHGC’s 
(equivalent  SHGC’s,  pivot  and  slant  theorems).  We 
now  present  the  general  idea  of  the  proof,  which  is  the 
same  for  the  three  propositions,  then  give  without  proof 
three  lemmas,  and  finally  give  the  detailed  proofs  of  the 
propositions. 

Two  SHGC’s  (xoi,Pi,(pi,?i),Si,Ci)  and 
(x02,  P>-(P2'  92).  S2,  Co)  parameterize  the  same  surface 
iff  there  exists  a  bijection  h  :  (si.tj)  Eh  x.  J\  — 
( so.  t-> )  E  h  x  J2  such  that,  for  all  (si,  ti)  E  h  x  J\ ,  we 
liave  X[  (.«) ,  /] )  =  X2  o  h(S] ,  ti ),  where  “o”  denotes  the 
operator  of  composition  of  two  mappings. 

As  the  surface  associated  to  a  SIIGC  is  regular,  h  is 
a  diffeomorphism  (i.e.,  it  is  differentiable,  invertible, 
and  its  inverse  is  differentiable,  see  [do  Carmo,  1976, 
pp.  70]).  To  prove  our  results,  we  will  use  this  result 
and  the  fact  that  for  all  (tu.C)  E  h  x  J 1,  we  have 


x2o/i(s2,  f2)  =  xi(«i,  <1 ) ,  and,  as  the  tangent  plane  of  a 
regular  surface  is  independent  of  the  parameterization, 
we  have  «i (si ,  t\)  x  n2o/i(si ,  t) )  =  0  (where  11,  denotes 
the  normal  to  the  \th  SIIGC). 

In  our  proofs,  we  will  often  have  10  prove  that  s2  (re¬ 
spectively  f2)  is  a  function  of  si  (respectively  l\)  only, 
which  is  not  a  priori  evident,  and  in  fart  would  be  false 
for  two  arbitrary  parameterizations  of  arbitrary  regular 
surfaces.  The  following  two  lemmas  will  be  useful  for 
that  purpose. 

Lemma  7.2:  Let  I  and  J  he  two  open  intervals  of  li 
and  let  f  :  I  C  R  — *  R  and  g  :  ,]  C  R  — *  R  he  two  real 
differentiable  functions  defined  on  these  intervals.  Let 
I\  be  a  third  open  interval  of  R  and  h  :  I  x  A'  —  J  U‘ 
a  differentiable  function  from  I  x  K  into  J  such  that 

V(s,  t)  E  I  x  !\,g  o  h(s.  i )  =  f(s).  (7.8) 

Then,  if  there  is  no  non-empty  open  interval  such  that 
f  is  constant  on  that  interval,  h  is  a  function  of  s  only . 


Lemma  7.3:  Let  t  and  J  be  two  open  intervals  of 
R  and  let  (xi,yi)  :  I  C  R  —  R2  and  ( jt 2 .  J/2 )  :  J  C 
R  —*  R2  be  two  differentiable  simple  curves  defined  on 
these  intervals.  Let  I\  be  a  third  open  interval  of  R  and 
h  :  I  x  A'  — *  J  be  a  differentiable  function  from  I  x  I\ 
into  J  such  that 

V(s,  t)  E  I  x  A',  X\(s)yioh(s,t)-  yi(s)x->oh(s,t)  =  0. 

(7.9) 

Then,  if  there  is  no  non-empty  oyicn  interval  such  that 
xi(s)!/i(s)  —  yj(s)2i(s)  is  zero  on  that  interval,  h  is  a 
function  of  s  only. 

The  following  lemma  will  K-  useful  to  characterize 
the  shape  of  SIIGC’s  parameterizing  t  he  same  surface. 
Lemma  7.4:  Let  I  and  J  be  two  open  intervals  of  II 
Let,  /1  :  /  C  R  •  R  and  /2  :  /  C  R .  — ♦  R  I*’  two 
real  functions  defined  on  /,  and  g\  :  J  C  R  —  R  and 
<72  :  J  C  R  — *  R  be  two  real  functions  defined  on  I. 
such  that 

V(s,f)6  lx  J,  f\{s)gi(t)  —  f-2(s)g->(t)  —  0.  (7.10) 

Then,  if  neither  f\  nor  g\  is  ident  ically  0.  there  exists 
K  ^  0,  such  that  for  all  s  E  / .  / 1  ( s )  =  A’/2(s),  ami  for 
all  t  E  <72(0  =  A  <7i(<). 

We  now  have  all  the  tools  necessary  to  prove  our  three 
propositions. 

Proposition  3.1  -.The  surfaces  associated  with  two 
SIIGC’s  with  the  same  (non-orient^d)  cross-sect  ion 
plane  and  (non-oriental)  axis  arc  the  same  i/f  the  two 
SlIGC's  arc  equivalent. 


Proof  of  the  proposition:  Trivially,  the  (noil- 
oriented)  axis  and  (non-oriented)  cross-section  plane, 
and  the  surface  associated  with  a  SHGC  are  invariant 
through  any  of  the  transformations  which  relate  two 
equivalent  SHGC’s.  This  proves  the  “if'  part  of  the 
proposition. 

Reciprocally,  suppose  that  the  same  surface  is  associ¬ 
ated  to  two  SHGC’s  with  the  same  (non-oriented)  axis 
and  cross-section  plane.  It  is  clear  then  that,  through 
one  or  several  of  the  first  four  of  the  equivalence  trans¬ 
formations,  the  orientation  of  the  plane  and  axis,  the 
origin  Xn.  and  the  vectors  (i,j)  arid  a  associated  with 
the  two  SHGC’s  can  he  made  to  coincide. 

We  have  only  to  prove  that  in  that  case,  the  reference 
cross-section  and  the  sweeping  rule  associated  with  the 
two  SlltJC's  are  obtained  through  two  inverse  constant 
scalings. 

bet  us  consider  two  SHGC’s  (xo,  P,(p,q),Si,Ci)  and 
( x,).  P.(]>.  C’s)  with  the  same  origin  xq,  plane  P 
and  axis  direct  ion  a  The  surfaces  associated  with  these 
two  SHGC  s  are  the  same  iff  there  exists  a  Injection 
h  (*i .  I\ )  €  / 1  x  7 1  —  (s-_>,  h)  £  1 2  x  ,/2  such  that,  for 
any  (sq./i)  £  /(  x  7t  and  (s2,  b>)  =  li(sl,t\),  we  have 


Proposition  5.1: The  surface  associated  with  a  non 
linear  SHGC  cannot  he  described  by  another  SUCH 
with  a  different  axis,  and  the  same  cross-sect  ion  plane. 

Proof  of  the  proposition:  Let  us  consider  two 
SHGC’s  Um.P. (/>..< /,)•>•,  •  <", ) 

and  (xo2.  P,  (p->,  (’•_>)  with  the  same  cross-sect  ion 

plane  P. 

(x 02  -  x o  i  \ 

1/02  -  //111 
-02  -  -01  / 

/  r2x 2  -  r,j-i  \  /  :■,}>■,  -  \ 

+  (  r2j/2  —  rl!/l  I  +  [  C 2 </2  —  ~  1  <7 1  I  (7.11) 

We  can  suppose  without  loss  of  generality  that  x01  and 
X02  have  the  same  k  coordinate  ;0  (by  translating  if 
necessary  one  of  the  origins  along  the  corresponding 
axis).  In  particular,  this  implies  that  for  all  {s\,t\)  6 
/)  x  J|,  we  have  c  i  ( sq  )  =  z-j(s z)- 

We  have  seen  earlier  that  this  implies  that  ,s2  is  a  func¬ 
tion  of  Sj  only,  and  we  can  from  now  on  assume  wit  hout 
loss  of  generality  that  r2  and  z->  are  parameterized  by 

«i  • 

Similarly,  we  now  prove  that  t ■>  is  a  function  of  t]  only 
The  k  component  n.  of  11 1  x  tij  is 


bsmg  the  k  coordinate  of  the  previous  equation,  we 
can  eliminate  :2  —  ci  =0  from  its  i  and  j  coordinates. 
Cross-multiplying  these  coordinates,  and  using  the  fact 
that  r,  and  r2  are  strictly  positive,  we  finally  get  the 
system  of  equal  ions 

x2,Vi  -  1/2 x  1  =  0.  (7.12) 

-2  =  -'i-  (7.13) 

As  remarked  earlier,  h  is  a  diffeomorphism,  and  is  in 
particular  differentiable.  Clearly,  the  two  above  equa¬ 
tions  satisfy  the  conditions  of  lemmas  (7.2)  and  (7.3). 
and  we  conclude  that  ,s->  is  a  function  of  sq  only  and 
that  t-<  is  a  function  of  /,  only. 

\\e  can  now  suppose  without  loss  of  generality  that 
both  the  SHCC's  are  parameterized  by  the  same  pa¬ 
rameters  ( s [  .  / 1  ) ,  From  the  above  equations,  we  have 
in  particular  that,  for  all  (st .  tt )  £  /j  x  J\,  r,  =  r-,r-, 
and  rqj/1  =  r-,;j2 

I  his  time,  the  hypotheses  of  lemma  (7.‘1)  are  clearly 
satisfied,  and  we  conclude  that  there  exists  a  non-zero 
constant  I\  such  that,  rq  =  x2  =  /\X|,  and  ;/■>  = 

At/  1  1  his  completes  the  proof  of  the  proposition.  ■ 


=  -(y^hWiVt)  -  J'2('2).'/i(fi  ))(i'i  ’'2-'!c,)  = 

=  -(tfj(/2)*',«i)  -  ^(Of/KMMtV 2--;-).  (7.1a) 

As  by  hypothesis,  : [  is  not  identically  0,  we  can  re¬ 
strict  ourselves  to  an  interval  /  where  c|  is  nowhere  11. 
As  iq  and  r2  are  never  zero,  we  can  use  on  that  inter¬ 
val,  lemma  (7.3)  (its  conditions  are  once  again  clearly 
satisfied).  It  follows  that  C  Is  a  function  of  1 1  only. 

To  complete  tin-  proof,  we  are  going  to  use  again  the 
fact  that.  x2  —  X|  is  a  constant  (zero)  vector.  As  I-.  is 
a  function  of  t\  only,  and  s->  is  a  function  of  sq  only, 
we  can  suppose  without  loss  '.if  generality  that  both 
SIIGC's  are  parameterized  by  ( .v t ,  / , ) .  In  the  coordi¬ 
nate  system  P,  we  have 


In  particular,  we  have  0  =  A’,  =  —  j-'  iq  and 

0  =  1  r  =  l/f.r-j  ^  !l\  c !  We  deduce  as  before  that 


-  hr2,x2  =  Kx\,y2  =  Ky{  for  some  (non-zero) 
constant  K . 

Substituting  in  X  and  V',  we  get,  that  X  =  (xo2  — xoi)-f 
ar2  +  :{P2  -  Pi)  and  Y  =  (y02  -  y0i)  +  br2  +  z(q2  -q  1) 
for  some  constants  a  and  b.  As  both  X  and  Y  must  be 
0,  and  by  hypothesis  (z,r2)  is  not  linear,  it  follows  that 
a  =  6  =  0,  pi  =  P2,  <?i  =  q2,  xoi  =  *02.  and  yoi  —  yo2- 
This  completes  the  proof  of  the  proposition.  ■ 

Proposition  5.2 -.The  surface  associated  with  a  non 
linear  SHGC  cannot  be  described  by  another  SHGC 
with  the  same  axis,  and  a  different  cross-section  plane. 
Proof  of  the  proposition:  Let  us  consider  two 
SIIGC’s  with  the  same  axis.  We  can  suppose  without 
loss  of  generality  that  their  origins  coincide  (by  trans¬ 
lating  if  necessary  the  origin  of  one  of  them  along  the 
axis).  We  can  also  suppose  without  loss  of  generality 
that  the  vectors  i  of  the  associated  coordinate  systems 
Pi  and  P2  are  the  same  (by  rotating  if  necessary  the 
two  coordinate  systems  around  their  k  vectors). 

The  twoSHGC’s  are  (x0,  Pi  =  (i,  ji ,  kj ),  a,  Si  ,Ci)  and 
(x0,P2  =  (i.  j2,  k2),  a,  S2,  C2).  Notice  that  a  does  not 
have  the  same  coordinates  in  the  planes  Pi  and  P2.  Let 
(p.q)  be  its  coordinates  in  P\.  Let  ct  be  the  angle  of 
the  rotation  around  i  which  maps  Pt  onto  P2,  we  have 

j2  =  cos  aji  +sin  aki ;  k2  =  -  sin  aji  +cos  akt .  (7.17) 

In  particular,  the  coordinates  of  a  point  x2(s2,  t2)  of  the 
second  SHGC  in  the  coordinate  system  (xo,  i,  ji ,  ki)  are 

(  \  (P\ 

x2(s2.  t2)  =  r2  I  y2cosa  +  z2  </  .  (7.18) 

\  J/2  sin  a  /  \1/ 

The  coordinates  of  n2  in  Pi  are  given  by 

/  0  \ 

1 1 2 ( s 2 ■  t2)  =  r2r'2{x'2y2  -  y2x2)  -  sin  a 

V  cos  a  I 


(y2{c os  a  —  q  sin  a) 
— x2  4-  y^psin  a 
x'2q  —  y2p  cos  a 


(7.19) 


Writing  x2  =  Xi.  we  get 


--  (-2  -  Cl 


(7.20) 


Eliminating  :2  -  by  using  the  ki  coordinate  of  this 
equation,  and  cross-multiplying  its  i  and  ji  coordinates, 
we  get  (as  r j  and  r2  are  strictly  positive) 


V(si,<i)  S  1 1  x  Jlt  Xi{tl)y2(t2)(cosa  -  q  sino) 

~yi{h){x2(t2)  -  y2(t2)psina)  -  0.  (7.21). 

Let  Y2  =  y2(cos  a  —  q  sin  a)  and  X2  —  x2  -  j/2psina, 
we  have  X'2Y2  —  Y2X2  =  (cosa  -  q  sin  a)(x'2y2  —  y2x2). 
The  quantity  (cosa  —  q  sin  a)  is  equal  to  a  ■  k2,  and  is 
therefore  different  from  zero  (the  axis  of  a  SHGC  is  not 
in  the  cross  section  plane  of  this  SHGC). 

The  hypotheses  of  lemma  (7.3)  are  therefore  satisfied, 
and  we  conclude  that  t\  is  a  function  of  t2  only.  We 
now  write  that  ni  x  n2  ^  0.  In  particular,  writing 
that  the  kj  component  of  this  cross  product  is  zero,  we 
obtain  that 

z[{T'i(y[  sina(x2y2  -  x2y2))  -  z^XiJ/^cos  a  -  gsina) 

-yU*-!  -  y2Psina)))  =  0.  (7.22) 

Let  (so,  to)  be  a  point  where  2 J  o  Si(so,  to)  j1  0,  (such 
a  point  exists  as  z[  is  not  identically  0,  and  h  is  a 

bijection),  and  70  x  J0  an  open  rectangle  centered  at 

(s0,  t0)  such  that  for  all  (s2,  t2)  £  70x  J0,  z[osl(s2,t2)  ± 
0  (this  rectangle  exists  because  z[  o  s  1  is  continuous. 
We  then  have 

V(s2,i2)  e  70  x  Jo,  r2  (yj  sin  a(x'2y2  -  x2y2)) 

~z2  ( x^y^cosa  -  q  sina)  -  y[{x'0  -  y^psina))  =  C. 

(7.23) 

Let  us  suppose  that  sina  ^  0.  The  hypotheses  of 
lemma  (7.4)  are  then  trivially  satisfied  with  = 

r2.  Si  =  y{  sin a(xiy2  -  x2y2),  f2  =  :(•  and  g2  = 

x[  y'2(cosa  —  ysina)  —  y[(x2  —  y2psina). 

In  particular  this  implies  that  (r2,z2)  is  linear  on 
7o,  which  contradicts  the  hypothesis.  It  follows  that 
sina  =  0,  which  completes  the  proof  of  the  proposi¬ 
tion.  ■ 
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Abstract 

Following  Rosenfeld  [Rosenfeld  1986],  we  compare  in 
this  short  paper  Blum,  Brooks,  and  Brady  ribbons.  We 
prove  that  Blum  and  Brady  ribbons  are  not,  in  general, 
Brooks  ribbons.  Conversely,  we  prove  that  Brooks  rib¬ 
bons  are,  in  general,  neither  Blum  nor  Brady  ribbons. 
For  Blum  and  Brooks  ribbons,  it  is  in  principle  trivial 
to  decide  whether  two  contour  points  may  form  a  rib¬ 
bon  pair:  they  have  to  form  a  local  symmetry,  i.e.,  the 
angles  between  the  line  joining  these  two  points  and  the 
outward  normals  to  the  contour  at  these  points  must  be 
the  same.  This  property  is  not  true  for  Brooks  ribbons. 
Is  it  then  possible  to  characterize  locally  the  pairs  of 
contour  points  which  form  a  Brooks  ribbon  pair?  Us¬ 
ing  the  curvature  of  the  contour  of  a  Brooks  ribbon, 
we  prove  that  the  answer  to  this  question  is  yes  for 
some  classes  of  Brooks  ribbons,  including  skewed  sym¬ 
metries.  This  result  has  potential  practical  applications 
for  three-dimensional  shape  recovery  from  image  con¬ 
tours  [Kanade,  1981]. 

Introduction 

Ribbons  are  planar  shapes  generated  by  sweeping  a 
geometric  figure  along  a  plane  curve.  Rosenfeld  [Rosen¬ 
feld,  1986],  has  compared  three  popular  classes  of  rib¬ 
bons:  Blum  [Blum,  1967],  Brooks  [Brooks,  1981],  and 
Brady  [Brady  and  Asada,  1984]  ribbons. 

Rosenfeld  has  studied  the  properties  of  these  ribbons 
from  the  standpoints  of  both  generation  and  recovery, 
and  compared  their  relative  advantages  and  disadvan¬ 
tages.  He  has  also  proved  that  there  are  Brady  rib¬ 
bons  that  are  not  Brooks  or  Blum  ribbons  and  that, 
for  straight  spines,  ignoring  the  ends,  every  Blum  rib¬ 
bon  is  a  Brooks  ribbon,  and  every  Brooks  ribbon  is  a 
Brady  ribbon. _ 

"Support  for  this  work  was  provided  in  part  by  the  Air  borer 
Office  of  Scientific  Research  under  contract  F33615-85-C-S10U 
and  by  the  Advanced  Research  Projects  Agency  of  the  Depart¬ 
ment  of  Defense  under  Knowledge  Based  Vision  ADS  subcontract 
S' 1 093  S-l. 


Figure  1.  A  Blum  ribbon,  generated  by  sweeping  a  disc 
along  a  plane  curve.  In  this  figure,  the  boundary  of  the 
ribbon  has  been  computed  by  using  the  expression  for  the 
envelope  given  in  section  2. 

In  this  paper,  we  consider  Blum,  Brooks,  and  Brady 
ribbons  with  curved  spines.  We  prove  that,  in  this 
case,  Blum  ribbons  are  not,  in  general,  Brooks  ribbons, 
and  that  BrooKs  ribbons  are  not,  in  general,  Brady 
ribbons.  As  Blum  ribbons  are  always  Brady  ribbons, 
it  follows  immediatly  that  Brady  ribbons  are  not,  in 
general,  Brooks  ribbons,  and  that  Brooks  ribbons  are 
not,  in  general,  Brady  ribbons. 

In  particular,  this  implies  that  we  cannot  use  seg¬ 
mentation  algorithms  developed  for  finding  Blum  and 
Brady  ribbons  (e.g.,  [Bookstein,  1979],  [Serra,  1982], 
[Brady  and  Asada,  1984])  to  segment  images  contain¬ 
ing  Brooks  ribbons.  However,  certain  classes  of  Brooks 
ribbons,  e.g.,  skewed  symmetries,  are  of  practical  im¬ 
portance,  as  they  can  be  used  for  three-dimensional 
shape  recovery  from  image  contours  [Kanade,  1981]. 

A  particular  class  of  algorithms  [Brady  and  Asada. 
1984]  for  finding  Blum  and  Brady  ribbons  in  images  is 
based  on  the  fact  that  two  points  on  (lie  boundary  of 
such  a  ribbon  that  correspond  to  the  same  axis  point 
form  a  local  symmetry.  Such  ribbon  pairs  can  be  found 


Figure  2.  A  Brooks  ribbon,  generated  by  sweeping  a  line 
segment  along  a  plane  curve.  The  angle  between  this  line 
segment  and  the  tangent  to  the  curve  is  constant. 

by  testing  all  possible  pairs  of  contour  points  f~r  lo¬ 
cal  symmetry,  and  these  ribbons  pairs  can  in  turn  be 
grouped  into  ribbons. 

If  we  can  find  such  a  “local  signature”  (i.e.,  char¬ 
acterize  ribbon  pairs  based  on  image  features  such  as 
angles,  distances,  and  curvatures,  which  are  measur¬ 
able  locally  in  images)  for  Brooks  ribbons,  then  we  can 
use  similar  algorithms  to  segment  images  containing 
Brooks  ribbons. 

Is  there  any  such  form  of  local  signature  for  Brooks 
ribbons?  Using  contour  curvature,  we  show  that  the 
answer  is  no  in  the  most  general  case,  but  yes  for  some 
classes  of  Brooks  ribbons.  Skewed  symmetries  are  one 
of  these  classes. 


Figure  3.  Brady  ribbons  are  defined  by  local  symmetries. 

In  this  paper,  we  will  characterize  ribbons  in  terms  of 
ribbon  pairs.  Notice  that  a  given  point  may  participate 
in  several  ribbon  pairs. 

Blum  ribbons  ([Blum,  1967], [Blum  and  Nagel,  1978], 
see  Figure  1)  are  generated  by  a  disk  centered  on  the 
spine.  Correspondingly,  Blum  ribbon  pairs  are  made 
of  points  such  that  there  exists  a  disc  included  in  the 
ribbon  and  tangent  to  its  boundary  at  these  two  points. 

Brooks  ribbons  ([Brooks,  1981],  see  Figure  2)  are 
generated  by  line  segments  centered  on  the  sp,  re  such 
that  the  angle  between  the  line  segments  and  t  e  tan¬ 
gent  to  the  spine  is  constant.  Brooks  ribbon  par  •  cor¬ 
respond  to  the  end  points  of  these  segments. 

Brooks  ribbons  are  two-dimensional  versions  Oi  en- 
eralized  cylinders  [Binford,  1971].  Notice  that  Rc  rn- 


The  paper  is  organized  as  follows.  In  section  1,  we 
define  more  precisely  Blum,  Brooks,  and  Brady  rib¬ 
bons.  We  then  prove  (section  2)  that  Blum  ribbons  are 
not  Brooks  ribbons,  and  (section  3)  that  Brooks  rib¬ 
bons  are  not  Brady  ribbons.  In  section  4,  we  calculate 
the  curvature  of  the  contour  of  a  Brooks  ribbon,  and 
show  that  it  can  be  used  to  characterize  ribbon  pairs 
for  some  classes  of  Brooks  ribbons.  Finally,  in  section 
5,  we  use  these  results  to  give  a  local  characterization 
of  pairs  of  points  which  form  skewed  symmetries,  and 
sketch  an  algorithm  for  finding  skewed  symmetries  in 
an  image. 


1.  Ribbons 


Following  Rosenfeld  [Rosenfeld,  1986],  we  call  ribbon 
a  plane  shape  generated  by  sweeping  a  geometric  figure 
along  a  spine.  The  size  and  the  orientation  of  this  figure 
may  change  as  it  moves. 

A  ribbon  pair  is  a  pair  of  points  on  the  boundary 
of  a  ribbon  which  correspond  to  the  same  axis  point. 


feld  [Rosenfeld,  1986]  considers  in  his  analysis  the  c  .ss 
of  Brooks  ribbons  for  which  the  angle  between  the  gi  it¬ 
erators  and  the  spine  is  a  right  angle. 

Brady  ribbons  ([Brady,  1984],  see  Figure  3)  are  gen¬ 
erated  by  line  segments  which  make  equal  angles  with 
the  sides  of  the  ribbon.  A  Brady  ribbon  pair  consists  of 
two  points  such  that  the  angles  between  their  outward 
normal  and  the  line  joining  the  points  are  the  same,  i.e. 
these  points  form  a  local  symmetry. 

It  is  not  a  priori  evident  that.  Brady  ribbons  have 
a  smooth  spine,  Giblin  and  Brassett  have,  however, 
proved  that  it  is  in  general  true  [Giblin  and  Brassett, 
1985].  Conversely,  Brooks  ribbons  have  by  definition 
a  smooth  spine,  but  it  is  not  clear  what  the  ribbon 
associated  to  a  given  arbitrary  shape  is. 

In  the  sequel,  we  will  suppose  that  the  spine  of  a  rib¬ 
bon  is  a  smooth  plane  curve  x(s)  parameterized  by  its 
arc  length  s.  The  corresponding  unit  tangent,  normal, 
and  curvature  will  be  denoted  by  t,  n,  k. 

W  ewill  use  the  fact  that,  from  elementary  differential 
geometry  [do  Carmo,  1976],  we  have 


Figure  4.  This  Blum  ribbon  is  not  a  Brooks  ribbon.  The 
axis  drawn  in  this  figure  is  the  Brady/Brooks  axis  corre¬ 
sponding  to  the  envelope  of  the  Blum  ribbon.  The  angle 
between  the  line  generators  and  the  axis  is  not  constant. 


dx  d  t 
ds  ’  ds 


=  KU, 


dn 

ds 


—  Kt. 


2.  Blum  ribbons  are  not  Brooks  ribbons 

We  now  prove  that  Blum  ribbons  are  not,  in  general 
Brooks  ribbons.  An  immediate  corollary  of  this  result 
is  that  Brady  ribbons  (at  least  those  which  may  also  be 
described  as  Blum  ribbons)  are  not,  in  general,  Brooks 
ribbons. 

To  prove  this  result,  we  consider  Blum  ribbons  whose 
spine  is  a  simple  smooth  curve.  The  boundary  of  a 
Blum  ribbon  is  given  by  its  envelope.  We  derive  an 
expression  for  the  envelope,  and  show  that  the  angle 
between  the  line  joining  two  points  on  this  envelope 
which  form  a  Blum  ribbon  pair  and  the  tangent  to  the 
spine  is  not  constant  (Figure  4).  This  proves  that  the 
ribbon  is  not  a  Brooks  ribbon. 

Suppose  the  Blum  spine  given  by  a  plane  curve  x(s) 
parameterized  by  its  arc  length  s,  and  let  t,  n,  and 
k  denote  the  corresponding  tangent,  normal,  and  cur¬ 
vature.  Consider  now  a  function  r(s)  and  the  region 
r(s,  0)  defined  by 


r(s,  0 )  =  x(s)  +  r(.s)(cosf?t  -F  sin  flu). 

This  region  is  a  Blum  ribbon,  and  x(s)  is  its  spine. 
The  boundary  of  this  ribbon  is  given  by  its  envelope 
e(s),  which  is  tangent  to  the  curves  r (s,0)  for  every  s 
(see  [Faux  and  Pratt,  1979]).  Methods  for  character¬ 
izing  the  envelope  can  be  found  in  the  literature.  In 
particular,  we  must  have 


dr  do 


where  “x”  denotes  the  operator  which  associates  to 
two  vectors  their  determinant.  Substituting 


de  _  d[r(s,  0(s))]  _  d6  dr  dr 
ds  ds  ds  80  +  ds' 

in  the  previous  equation,  we  get 


We  have 


dr  dr 

37  x  5?  =  0 


=  (1  +  r  cos  0  —  act-  sin  f?)  t  +  (r'  sin  0  +  nr  cosf?)n, 
ds 

dr  . 

—  =  r(—  sinOt  +  cos  On). 
dO 

It  follows  that  the  envelope  is  given  by  the  points 
(s, 0)  which  verify 


cos  6  +  r'(s)  =  0. 

This  equation  has  solutions  iff  r'(s)|  <  1,  and  two  so¬ 
lutions  in  that  case.  The  corresponding  envelope  points 
are  then  x((s),e  =  =Fl>  with 

xt(s)  =  x  +  r(  —  r't  +  e\/l  —  r'2n). 

The  pairs  (x_i,  x+i)  define  a  Blum  ribbon  pair,  and 
therefore  a  Brady  ribbon  pair.  We  are  now  going  to 
compute  the  axis  of  the  associated  Brady  ribbon,  and 
then  prove  that  the  pair  is  not  a  Brooks  ribbon  pair. 
The  axis  of  the  Brady  ribbon  is  given  by 

1/ 

x0  =  -(x-i  +x+l)  =  x  -  rr  t. 

Note  that,  as  expected,  the  Blum  and  Brady  axes 
don’t  coincide,  unless  the  Blum  axis  is  straight  or  the 
function  r  is  constant.  Let  us  now  show  that  this  ribbon 
pair  is  not  a  Brooks  ribbon  pair.  If  it  was  one,  its 
Brooks  axis  and  Brady  axis  would  coincide  (each  of 
them  is  the  locus  of  the  mid-points  of  the  ribbon  pair). 

But  the  angle  a  between  the  tangent  to  the  axis  of 
a  Brooks  ribbon  and  the  line  joining  the  points  of  any 
pair  of  this  ribbon  is  constant.  Let  us  calculate  this 
angle.  We  first  calculate  the  tangent  to  xu(s).  We 
have 


dx  o  _ 
ds 


l/\A.  / 

—  rr  )t  —  Krr  n. 


So  the  unit  tangent  t0  to  the  Brady  axis  is 


107f> 


Figure  5.  This  Brooks  ribbon  is  not  a  Brady  ribbon:  the 
angles  between  a  generator  and  its  sides  are  not  the  same. 


to 


— rr")t— /crr'n). 


The  direction  of  the  line  joining  x_r  to  x+i  is  n.  We 
have  therefore 


—/err' 

cos  a  — 

This  angle  is  clearly  non-constant  in  general.  A  suf¬ 
ficient  condition  for  it  to  be  constant  (and  in  that  case 
the  angle  is  7t/2)  is  that  k  =  0  or  r'  =  0,  as  could  be 
expected.  We  have  just  proved  that  Blum  ribbons  and 
Brady  ribbons  are,  in  general,  not  Brooks  ribbons. 


-  r'2- 


rr")2  +  (/err')2 


3.  Brooks  ribbons  are  not  Brady  ribbons 

We  now  prove  that  Brooks  ribbons  are  not,  in  gen¬ 
eral,  Brady  ribbons.  An  immediate  corollary  of  this 
result  is  that  Brooks  ribbons  are  not,  in  general,  Blum 
ribbons. 

To  prove  this  result,  we  consider  a  Brooks  ribbon, 
and  derive  an  expression  for  points  on  its  boundary 
which  form  a  ribbon  pair.  We  prove  that  the  angles 
between  the  outward  normals  to  the  contour  at  these 
points  and  the  line  joining  the  points  are  not  the  same 
(Figure  5).  The  result  follows  immediatly. 

Consider  an  angle  6,  a  function  r(s)  and  the  curve 
x<(s),c  =  Tl,  defined  by 


xf(s)  =  x(s)  +  cr(s)(cos#t(s)  +  sin0n(s)). 

We  have  just  defined  a  Brooks  ribbon  of  spine  x(s) 
whose  sides  are  given  by  x_i(s)  and  x+i(s).  For  a 
given  s,  the  pair  (x_i,x+i)  is  a  ribbon  pair. 


Let  s(,  and  te  denote  the  associated  arc  length  and 
unit  tangent.  We  have 

■ =  [l  +  £(r'cos0  —  /crsin0)]t  +  e(r'sin0+/crcos0)n. 
as 

Therefore, 


=  |  =  \/T+  2 e(r'  cos 9  —  /cr  sin  9)  +  «2r2  -f  r'2, 

as  as 


dxc  _  ds  dxc 
1  dsc  dsc  ds 

The  direction  of  the  line  joining  x_i  and  x+i  is  c  = 
cos  Ot  +  sin  9n.  Let  o£  be  the  angle  between  c  and  t(, 
we  deduce  from  the  above  expression  that 


COS  Orc  — 


cos  9  +  cr' 

^1  +  2c(r'  cos  0  —  kt  sin  9)  +  /c2r2  +  r'2 


sinae  = 


sin  6  —  ckt 

C  +  2c(r'  cos  0  ~  kt  sin  0)  +  /c2r2  -f  r'2 


The  points  x_!  and  x+i  form  a  Brady  ribbon  pair 
(i.e.,  a  local  symmetry)  iff  a_i  =  x  —  a+i,  or  equiv¬ 
alently  tana_i  =  —  tana+i.  This  condition  can  be 
easily  rewritten 


/err'  =  —  sin  9  cos  9. 


It  follows  that  Brooks  ribbons  are  not,  in  general, 
Brady  ribbons.  In  the  case  studied  by  Rosenfeld 
[Rosenfeld,  1986],  we  have  9  =  n/2  (right  ribbon),  and 
a  Brooks  ribbon  is  a  Brady  ribbon  iff  k  =  0  (straight 
axis)  or  r'  =  0  (constant  cross-section).  In  the  gen¬ 
eral  case,  even  Brooks  ribbons  with  a  straight  axis  or 
a  constant  cross-section  are  not  Brady  ribbons. 


4.  Local  signature  of  Brooks  ribbons 

We  have  shown  that  Brooks  ribbons  are  not  Brady 
ribbons.  In  particular  they  don’t  form  local  symme¬ 
tries,  and  algorithms  like  those  used  to  find  smooth 
local  symmetries  [Brady  and  Asada,  1984]  cannot  be 
used  to  segment  images  containing  Brooks  ribbons. 

Is  it  possible  to  find  another  “local  signature”  charac¬ 
terizing  Brooks  ribbon  pairs  based  on  quantities  mea¬ 
surable  locally  in  images  (i.e.,  distances,  angles,  curva¬ 
tures)?  To  answer  this  question,  we  now  calculate  the 
curvature  /c(  of  a  Brooks  ribbon. 


Figure  6.  A  worm,  i.e.,  a  Brooks  ribbon  with  a  constant 
cross-section. 


It  is  well  known  that  the  curvature  of  a  plane  curve 
y (t)  (not  necessarily  parameterized  by  its  arc  length) 
is  given  by 

where  “x"  denotes  the  operator  which  associates  to  two 
vectors  their  determinant  (see,  for  example  [do  Carmo, 
1976],  pp.  25). 

We  have 


-f-  =  c((r"  —  k 2r)  cos#  —  (kV  +  2k/)  sinf?)t 
as 1 

+[k  +  e((r"  —  K2r)  sin  6  +  (n'r  -f  2 Kr')  cos  0)n. 
It  follows  that 


kc |x[  |3  —  k(  I  +  2e(r'  cos  0  —  nr  sin0)  +  K2r2  +  r'2) 

+(e  cos  6  +  r')(K'r  +  Kr')  +  (c  sin  6  —  k  r)r". 

As  could  be  expected,  the  curvature  Kt  depends  on 
K,K',r,r',  r",  and  0.  On  the  other  hand,  we  can  measure 
in  the  images  the  quantities  o{,  kc  (c  =  :pl),  and  r. 

We  have  therefore  six  unknowns  for  five  measures, 
so  it  is  not  possible,  in  general,  to  characterize  Brooks 
ribbon  pairs  based  only  on  local  information.  If  some  of 
the  Brooks  ribbon  parameters  are  fixed,  however,  this 
characterization  becomes  possible. 

We  now  consider  such  an  example,  the  case  of  a  worm 
(Figure  6),  i.e.,  a  Brooks  ribbon  with  a  constant  cross- 
section  (r'  =  r"  =  0).  We  now  have  four  unknowns 
and  five  measures,  so  we  can  characterize  ribbon  pairs. 
The  expressions  of  cosaf,  tanae,  and  Kt  simplify  into 


cos  a(  = 


'1  —  2 (Kr  sin  6  +  K2r2 


Figure  7.  A  skewed  symmetry,  i.e.,  a  Brooks  ribbon  with 
a  straight  spine. 


tanat  = 


sin  0  —  (Kr 
cos  9 


k(1  —  2(Krs\n9  +  K2r2)  +  (K'rcosd 


(1  —  2fKrsin0  -)-  K2r2)3^2 


It  follows  that 


k+ i  +  gg^t.'.,(tan a+)  —  tana_j)  _  cosa+J  3 
K_i  +  f°'^r‘(tana+i  —  tana_i)  [cosa_i 

The  above  expression  is  a  necessary  local  condition 
that  a  pair  of  points  must  verify  to  form  a  worm  ribbon 
pair. 


5.  Skewed  symmetries 

We  finally  consider  a  simple  (but  important)  class  of 
Brooks  ribbons:  skewed  symmetries  (Figure  7).  It  has 
been  shown  [Kanade,  1981],  that  skewed  symmetries 
can  be  used  to  recover  the  three-dimensional  shape  of 
objects  with  planar  faces  and  bilateral  symmetries  from 
their  image  contours. 

Unfortunately,  it  is  difficult  to  find  skewed  symme¬ 
tries  in  an  image  (however,  see  [Friedberg,  1986],  for  a 
method  based  on  moments).  In  this  section,  we  prove 
that  skewed  symmetries  have  a  local  signature,  and  pro¬ 
pose  an  algorithm  for  finding  them. 

A  skewed  symmetry  is  a  Brooks  ribbon  with  a 
straight  axis.  In  that  case,  k  =  0  and  k'  =  0.  So, 
again,  we  have  four  unknowns  for  five  measures.  In 
particular,  we  have 


sin  ot  = 


/TT 2 cr'  cos  8  +  r'2 


FI 
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and 


_  cr"  sin  6 

c  (1  -f  2er'  cos<?  +  r'2)3/2 

It  follows  immediatly  that 


«+i 

K_1 


sina+i 


1  3 


sin  o_i 


This  property  is  a  necessary  condition,  based  on  lo¬ 
cal  image  measures,  that  two  points  must  verify  to 
form  a  skewed  symmetry.  It  can  therefore  be  used  to 
find  skewed  symmetries  in  an  image.  A  simple  algo¬ 
rithm  would  be  to  test  every  possible  pair  of  contour 
points,  check  whether  they  verify  the  above  condition, 
and  group  the  ribbon  pairs  into  ribbons.  This  is  anal¬ 
ogous  to  the  0(n2)  algorithm  for  finding  smooth  local 
symmetries  (see  [Brady  and  Asada,  1984]). 

Instead,  we  propose  to  use  the  method  of  projec¬ 
tions  [Nevatia  and  Binford,  1977].  This  method  can  be 
summarized  as  follows:  discretize  the  possible  orienta¬ 
tions  of  local  ribbon  axes;  for  each  of  these  directions, 
project  all  contour  points  into  buckets,  and  verify  the 
above  condition  for  points  within  the  same  bucket  only; 
group  the  resulting  ribbon  pairs  into  ribbons. 

This  method  has  been  implemented  for  finding  Brady 
ribbons  in  [Sumanaweera  et  al.,  1988].  Its  advantages 
are  its  complexity  (O(k.n),  where  k  is  the  number 
of  discretized  orientations  and  n  the  number  of  edge 
points),  and  an  easy  grouping  of  ribbon  pairs  into  rib¬ 
bons  (the  neighborood  information  is  preserved  during 
projection).  We  are  currently  working  on  extending 
this  method  to  skewed  symmetries. 
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ABSTRACT 

This  paper  describes  a  method  of  detecting  thin  cur¬ 
vilinear  features  in  an  image  based  on  a  detailed  analysis 
of  the  local  gray  level  patterns  at  each  pixel.  This  allows 
operations  such  as  thinning  and  gap  filling  to  be  based  on 
more  accurate  information. 


Many  types  of  images  contain  thin  curvilinear 
features;  for  example,  aerial  photographs  contain 
numerous  features  such  as  roads  and  streams.  This  paper 
describes  a  method  of  detecting  thin  curvilinear  features 
in  an  image  based  on  a  detailed  analysis  of  the  local  gray 
level  patterns  at  each  pixel.  As  we  shall  see,  this  allows 
operations  such  as  thinning  and  gap  filling  to  be  based  on 
more  accurate  information. 

The  traditional  approach  [ij  to  extracting  thin  curvi¬ 
linear  features  from  an  image  is  usually  along  the  follow¬ 
ing  lines; 

a)  Apply  local  line  detectors  to  the  image. 

b)  Link  the  detected  line  pixels  into  connected  com¬ 
ponents. 

c)  Thin  the  components  using  standard  binary-image 
thinning  techniques. 

d)  Fill  gaps  in  the  components  using  “good  continua¬ 
tion’’  interpolation. 

A  serious  weakness  of  this  approach  is  that  at  step 
va)  it  decides,  at  each  pixel,  whether  a  line  passes  through 
it.  and  if  so,  in  what  direction,  based  on  the  absolute  and 
relative  strengths  of  a  set  of  line  detector  responses.  This 
decision  is  made  independently  for  each  pixel. 
Thereafter,  in  steps  (b-d),  the  thin  region  is  treated  as  a 
set  of  pixels;  the  local  evidence  which  led  to  the  selection 
of  those  pixels  is  discarded.  [A  somewhat  more  flexible 
approach  estimates  probabilities  of  lines  in  various  direc¬ 
tions  at  each  pixel,  and  iteratively  adjusts  these  estimates 
based  on  the  estimates  at  neighboring  pixels.  This 

The  support  of  the  Defense  Mapping  Agency  under  Contract 
DMA-85 -C  -000?  is  gratefully  acknowledged,  as  is  the  help  of  Sandy  German 
tn  preparing  this  paper 


“relaxation”  approach  allows  the  neighbors  of  the  pixel  to 
influence  the  decision,  but  it  loses  information  by  combin¬ 
ing  numerical  measures  of  confidence,  and  in  any  case  it 
still  does  make  a  final  decision  and  then  treats  the  thin 
region  as  a  set  of  pixels.) 

Our  approach  tries  to  avoid  making  decisions  at  the 
individual  pixels,  even  with  the  help  of  their  neighbors. 
Instead,  it  examines  the  neighborhood  of  each  pixel  and 
identifies  all  local  patterns  of  gray  levels  in  that  neighbor¬ 
hood  that  are  consistent  with  the  presence  of  a  line. 
(Since  the  line  may  be  more  than  one  pixel  thick,  these 
include  all  the  patterns  that  are  consistent  with  the  pres¬ 
ence  of  an  edge.)  These  patterns  are  found  by  examining 
all  possible  thresholdings  of  the  3X3  neighborhood  and 
selecting  those  of  the  resulting  binary  patterns  (“masks”) 
that  are  line-  or  edge-like.  These  patterns  are  shown  in 
Figure  1. 

Each  pixel  now  has  associated  with  it  a  set  (possibly 
empty)  of  line-  or  edge-like  masks.  Note  that  there  can 
be  more  than  one  of  these  at  a  given  pixel,  since  two 
different  thresholds  can  both  yield  edge-  (or  line-)like 
binary  patterns,  as  we  see  if  we  apply  the  thresholds  1 
and  3  to  the  3X3  neighborhood 


However,  there  cannot  be  very  many  of  them;  if  the  gray 

levels  in  the  neighborhood  are  z( .  z9,  where 

0  =  z0  <  z j  <  z2  <  '  ■  <  z9  <  zl0  =  255,  the  patterns 

produced  by  any  threshold  between  z,  and  zI+1  are  all 
the  same,  and  some  of  these  (e.g.,  for  t  =  0  or  9)  are  not 
line-  or  edge-like. 

Now  that  we  have  associated  masks  with  the  pixels, 
we  can  use  the  mask  information  in  performing  steps 
analogous  to  (b-d)  rather  than  simply  working  with  sets 
of  pixels.  As  we  shall  now  see,  this  leads  to  improved 
results. 
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We  begin  by  examining  the  masks  at  pairs  of  adja¬ 
cent  pixels,  and  checking  pairs  of  these  masks  for  compa¬ 
tibility  with  the  presence  of  a  line  (or  edge).  Figure  2 
shows  sets  of  pairs  that  are  compatible  with  the  presence 
of  a  thin  line  (the  “extend"  and  “maybe-extend”  pairs) 
or  with  a  line  two  pixels  thick  (the  “broaden”  and 
“maybe-broaden”  pairs).  Figure  3  shows  sets  of  pairs 
that  are  compatible  with  the  presence  of  edges  in  various 
orientations. 

By  taking  the  transitive  closure  of  mask  compatibil¬ 
ity,  we  obtain  sets  of  masks  that  represent  connected  line 
(or  edge)  fragments.  These  are  analogous  to  the  con¬ 
nected  components  of  step  (b),  but  they  are  now  com¬ 
ponents  of  masks,  not  of  pixels,  and  a  pixel  may  belong 
to  more  than  one  component. 

We  can  now  illustrate  the  advantage  of  keeping  all 
the  masks  at  each  pixel  rather  than  just  the  “strongest” 
one.  [The  examples  used  in  this  and  the  following  para¬ 
graphs  are  all  fragments  of  roads  on  an  aerial  photo¬ 
graph;  they  are  all  taken  from  Figure  4.)  Figure  5a  shows 
an  8  X  8  portion  of  the  image;  the  numbers  in  the  centers 
of  the  squares  are  the  pixel  gray  levels.  In  each  square  we 
show  the  masks  that  describe  the  3X3  neighborhood  of 
that  pixel;  the  number  under  each  mask  is  a  measure  of 
its  contrast.  The  lines  joining  the  masks  represent  com¬ 
patibilities  (solid  lines  for  “extend”  compatibilities; 
dashed  lines  for  “broaden”  compatibilities).  We  see  that 
this  piece  of  image  contains  a  branching  thin  region.  If 
we  keep  only  the  highest-contrast  mask  at  each  pixel,  we 
obtain  Figure  5b,  in  which  nearly  all  the  compatibilities 
are  gone!  Things  are  somewhat  better,  but  still  far  from 
perfect,  if  we  keep  the  two  highest-contrast  masks  at 


each  pixel  (Figure  5c). 

We  next  illustrate  the  use  of  the  masks  in  “thin¬ 
ning”;  by  analyzing  the  mask  patterns  to  determine 
which  pixels  lie  on  or  closest  to  the  midline  of  the  curvi¬ 
linear  feature.  In  Figure  6,  (a)  shows  a  two  pixel  thick 
linear  feature;  (c-d)  show  standard  thinning  results 
(depending  on  which  side  we  thin  from  first);  and  (b) 
shows  the  result  when  the  masks  are  used.  The  details  of 
how  this  result  was  obtained  are  shown  in  Figure  7a, 
where  the  numbers  under  the  masks  are  now  component 
numbers;  component  949  is  the  one  shown  in  F'gure  6. 
The  masks  show  that  the  pixels  in  column  93  better 
represent  the  midline  in  rows  70-73,  while  those  in 
column  92  better  represent  it  in  rows  74-77.  In  Figure  7b 
the  masks  not  on  the  midline  have  been  dropped  (though 
their  links  to  other  masks  have  been  kept). 

Finally,  we  illustrate  the  use  of  the  masks  to  fill 
small  gaps.  Figure  8  shows  the  midlines  of  three  linear 
feature  fragments,  two  denoted  by  black  squares  and  one 
by  white  circles.  If  we  regarded  the  fragments  as  sets  of 
pixels,  and  filled  the  gap  on  geometric  grounds,  we  would 
join  the  white  fragment  to  the  lower  black  one.  Inspec¬ 
tion  of  the  masks,  however,  which  are  shown  (for  the 
boxed  center  part  of  Figure  8)  in  Figure  24,  indicates  that 
we  should  in  fact  join  the  white  fragment  (No.  240)  to 
the  upper  black  one  (No.  647),  not  to  the  lower  one  (No. 
626). 

A  more  detailed  discussion  of  the  mask-based 
approach  to  curvilinear  feature  detection  can  be  found  in 
[2-4],  where  additional  examples  are  also  given.  We 
believe  that  the  mask-based  approach  can  be  useful  in 
the  precise  delineation  of  thin  linear  features.  It  may  also 
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be  of  value  in  precisely  delineating  the  edges  of  thick 
objects;  here  we  would  use  only  the  edge-like  masks. 

It  should  be  pointed  out  that  the  mask-based 
approach  to  extracting  edge  or  curve  fragments  is  likely 
to  be  more  computationally  costly  than  conventional 
approaches.  However,  when  massively  parallel  SIMD  sys¬ 
tems  are  available  at  low  cost,  which  can  be  expected  to 
happen  during  the  next  decade,  the  extra  computational 
cost  of  detailed  pixel-neighborhood  analysis  will  no  longer 
be  a  major  factor.  We  have  therefore  concentrated  on 
demonstrating  the  advantages  of  such  analysis,  in  terms 
of  the  accuracy  of  the  resulting  linear  feature  delinea¬ 


tions,  on  the  assumption  that  in  the  relatively  near 
future,  it  will  no  longer  be  necessary  to  rule  out  this 
approach  on  grounds  of  cost. 
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Figure  2.  Consistent  pairs  of  local  edge  or  line  patterns. 
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Figure  3.  Consistent  pairs  of  local  edge  patterns. 
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Abstract 

Segmentation  is  an  important  step  in  scene  analy¬ 
sis.  Segmentation  algorithms  based  on  geometry  alone 
are  easily  confused  by  image  artefacts  due  to  physi¬ 
cal  processes,  for  example,  specularities.  On  the  other 
hand,  segmentation  algorithms  based  on  purely  local 
properties  are  rarely  adequate  for  segmenting  complex 
images.  In  this  paper,  we  describe  segmentation  algo¬ 
rithms  which  use  both  geometric  and  physical  proper¬ 
ties.  We  discuss  the  performance  of  these  algorithms. 

Introduction 

Segmentation  is  a  fundamental  first  step  in  scene 
analysis.  Good  edge  detectors,  such  as  [Canny,  1986], 
and  [Nalwa,  1986],  have  recently  become  available. 
However,  there  is  still  a  lack  of  robust  algorithms  to 
identify  geometrically  and  physically  meaningful  re¬ 
gions  and  contours  in  images  which  are  actual  projec¬ 
tions  of  three-dimensional  surface  patches  and  edges. 

In  this  paper,  we  describe  several  implemented  seg¬ 
mentation  algorithms  which  use  either  geometric  or 
physical  constraints.  This  paper  is  organized  into  six 
sections.  Sections  1,  2,  and  3  describe  our  geomet¬ 
ric  segmentation  algorithms.  These  algorithms  rely  on 
image  symmetries  and  graphs  of  regions,  edges,  and 
edge  junctions  to  find  meaningful  regions  in  the  im¬ 
age.  Sections  4,5,  and  G  describe  our  physical  segmen¬ 
tation  algorithms.  Section  4  provides  an  overview.  Sec¬ 
tion  5  presents  an  algorithm  which  uses  physical  con¬ 
straints  to  link  edges.  Section  6  discusses  the  problem 
of  constructing  a  scene  description  in  terms  of  surface 
patches,  their  properties,  and  their  relationships. 

•Support  for  this  work  was  provided  in  part  by  the  Air  Force 
Office  of  Scientific  Research  under  contract  F33G15-85-C-5106 
and  by  the  Advanced  Research  Projects  Agency  of  the  Depart¬ 
ment  of  Defense  under  Knowledge  Based  Vision  ADS  subcontract 
Si 093  S-l. 


Figure  1.  The  test  image:  a  Smith  valve 

1.  Geometric  segmentation 

In  the  next  three  sections,  we  concentrate  on  geo¬ 
metric  segmentation  using  contour  information.  Wo  as¬ 
sume  that  edges  have  been  found  and  linked  using  some 
edge  detection  scheme  (e.g.,  (Canny,  1980], [Nalwa, 
1986], [Nalwa  and  Pauchon.  1987]).  We  don't  assume, 
however,  perfect  linking,  closed  contours,  or  perfect 
junctions.  On  the  contrary,  we  assume  that  edges  may 
be  wrongly  linked,  that  gaps  and  edge  terminations  ex¬ 
ist,  and  that  junctions  may  be  missed. 

To  use  this  imperfect  boundary  information,  it  is 
necessary  to  characterize  relations  between  edge  frag¬ 
ments.  We  consider  in  this  paper  two  types  of  such  re¬ 
lations,  oppositeness,  and  eunilinearity.  Oppositencss 
relates  different  edges  which  bound  a  given  region.  We 
give  in  section  2  an  algorithm  to  find  ribbons,  which 
correspond  to  an  oppositeness  relation.  Curvilinearity 


Figure  2.  The  edges  found  in  the  valve  image  using 
Canny's  [Canny,  1986]  edge  detector. 
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relates  edges  which  satisfy  a  geometric  continuity  con¬ 
dition;  it  extends  colinearity.  In  section  3,  we  give  an 
algorithm,  based  cn  curvilinearity,  for  building  a  graph 
structure  of  edges  separated  by  junctions,  and  finding 
cycles  in  this  graph. 

Both  algorithms  have  been  implemented  on  a  Sym¬ 
bolics  lisp  machine.  They  are  demonstrated  on  a 
256  x  256  image  of  a  Smith  Valve,  shown  in  Figure  1. 
We  have  used  an  implementation  of  Canny's  edge  de¬ 
tector  [Canny,  1086]  to  provide  the  initial  edge  frag¬ 
ments.  Figure  2  shows  the  edges  output  by  Canny’s 
operator. 


2.  Segmentation  into  ribbons 

In  this  section,  we  describe  an  algorithm  for  finding 
ribbons  in  an  image  from  contour  data.  This  algorithm 
is  based  on  Nevatia's  and  Binford's  method  of  projec¬ 
tions  [Nevada  and  Binford,  1977],  but  presents  several 
significant  improvements.  In  particular,  our  algorithm 
deals  with  imperfect  contours  and  different  types  of  rib¬ 
bons. 

We  first  define  ribbons  (section  2.1),  summarize  the 
original  NVvatia  and  Binford  algorithm,  and  compare 
it  to  different  methods  (section  2.2).  We  then  detail 
our  algorithm  (section  2.3),  and  illustrate  and  discuss 
its  performances  on  some  examples  (section  2. 4).  Fi¬ 
nally  (section  2.5),  we  discuss  our  plans  for  future  work. 
We  give  rough  estimates  of  the  time  complexity  of  the 
different  steps  of  our  algorithm  throughout. 


2-1.  Ribbons 

In  this  paper,  a  ribbon  is  the  plane  shape  generated 
by  sweeping  a  line  segment,  called  a  generator,  or  cross- 
section,  along  a  plane  curve,  called  an  axis,  or  spine. 
The  oppositeness  relation  used  here  is  the  relation  be¬ 
tween  the  sides  of  a  ribbon. 

In  [Rosenfeld,  1986],  Rosenfeld  discusses  more  gen¬ 
eral  classes  of  ribbons,  where  the  generator  can  be  an 
arbitrary  plane  shape,  and  calls  the  class  of  ribbons 
considered  here  L-ribbons  (for  line  ribbons).  See  also 
[Ponce,  1988]  for  an  extension  of  Rosenfeld’s  results. 

The  length  of  the  generators,  as  well  as  their  ori¬ 
entation  with  respect  to  the  spine,  may  vary.  Two 
important  classes  of  ribbons  are  the  so-called  Brooks 
and  Brady  ribbons,  following  Rosenfeld's  terminology 
[Rosenfeld,  198C]. 

Brooks  ribbons  are  ribbons  such  that  the  angle  be¬ 
tween  the  generators  and  the  tangent  to  the  spine  is 
constant.  They  are  two-dimensional  generalized  cylin¬ 
ders  [Binford,  1971].  In  Acronym,  Brooks  [1981]  re¬ 
stricts  himself  mainly  to  ribbons  with  a  straight  spine. 

Brady  ribbons  (also  called  smooth  local  symmetries) 
[Brady  and  Asada,  1984]  are  ribbons  such  that  any  gen¬ 
erator  makes  equal  angles  with  the  sides  of  the  ribbon, 
i.e.,  the  ends  of  the  generators  are  locally  symmetric 
with  respect  to  their  mid-points. 

2-2.  The  method  of  projections 

Assume  for  the  moment  that  it  is  possible  to  decide 
whether  two  contour  points  form  a  ribbon  pair  (i.e.,  are 
the  end-points  of  a  ribbon  generator),  based  on  some 
image  measure.  This  can  clearly  be  done  for  Brady 
ribbons  (local  symmetries  are  characterized  by  the  fact 
that  the  angle  between  a  generator  and  the  outward 
normals  at  its  ends  are  the  same). 

We  will  come  back  later  to  this  problem  for  other 
types  of  ribbons.  Under  the  above  assumption,  ribbons 
can  be  found  in  an  image  by  first  finding  ribbon  pairs, 
and  then  grouping  ribbon  pairs  into  ribbons.  The  axis 
of  a  ribbon  is  given  by  the  mid-points  of  its  generators. 

The  first  part  of  this  algorithm,  finding  ribbon  pairs, 
can  be  implemented  in  different  ways,  more  or  less  ef¬ 
ficient.  The  simplest  implementation  of  this  algorithm 
is  to  test  all  possible  pairs  of  points.  This  is  the  algo¬ 
rithm  used  in  Brady  and  Asada  [1981]  to  find  Brady 
ribbons.  Its  complexity  is  0(n2),  where  n  is  the  num¬ 
ber  of  contour  points  (Brady  and  Asada  also  propose 
a  much  faster  algorithm,  based  on  analytical  solutions 
for  local  symmetries  between  circular  arcs,  but  it  does 
not  generalize  to  other  types  of  ribbons). 

Another  implemented  algorithm  is  the  method  of 
projections,  first  proposed  by  Nevatia  and  Binford 
[1977].  It  can  be  summarized  as  follows:  discretize  the 
possible  orientations  of  local  ribbon  axes;  for  each  of 
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Figure  3.  Some  ribbons  found  by  the  algorithm  of  section 
2  3 


these  directions,  project  all  contour  points  into  buck¬ 
ets,  and  verify  that  pairs  of  points  within  the  same 
bucket  form  ribbon  pairs  by  test  of  their  angular  bisec¬ 
tor.  The  complexity  of  this  method  is  0(k.n ),  where 
k  is  the  number  of  discretized  orientations.  For  large 
numbers  of  contour  points,  the  second  method  is  clearly 
more  efficient. 

2-3.  Algorithm 

The  original  method  of  projections  assumes  closed 
contours,  and  works  only  for  the  class  of  ribbons  con¬ 
sidered  by  Nevatia  and  Binford.  To  extend  this  method 
to  deal  with  different  kinds  of  ribbons,  and  imperfect 
contour  data,  we  use  the  following  algorithm,  based  on 
a  cost  function  that  is  zero  if  a  pair  of  points  is  a  perfect 
ribbon  pair,  and  increases  as  the  pair  of  points  becomes 
a  poorer  approximation  to  a  ribbon  pair. 

The  algorithm  can  be  decomposed  into  three  steps  as 
follows: 

1.  Projecting  points.  For  each  discretized  pro¬ 
jection  direction,  do:  project  all  contour  points 
in  buckets;  compare  pairs  of  points  within  each 
bucket,  and  assign  a  cost  to  each  of  these  pairs; 
output  each  pair  whose  cost  is  below  a  given 
threshold. 

2.  Sorting  ribbon  pairs.  Initialize  a  current  list  of 
ribbon  pairs  to  nil;  for  every  ribbon  pair  output 
by  the  previous  step,  add  the  pair  to  the  current 
list  of  ribbon  pairs,  keeping  it  sorted  according  to 
the  cost  function. 

3.  Building  ribbons.  Repeat  the  following  until  the 


List  is  empty:  pop  the  list,  the  corresponding  pair 
of  points  is  a  new  ribbon  pair;  if  it  is  connected 
to  an  existing  ribbon,  add  the  pair  to  the  ribbon, 
otherwise,  create  a  new  ribbon;  deactivate  all  in¬ 
compatible  ribbon  pairs. 

Some  details  are  missing  in  the  previous  algorithm. 
The  first  step  of  the  algorithm  is  relatively  straightfor¬ 
ward,  although  it  is  important  to  ensure  that  buckets 
are  large  enough  to  avoid  missing  some  ribbon  pairs 
(i.e.,  buckets  must  overlap).  It  should  be  noticed  that 
in  the  original  Nevatia  and  Binford  algorithm,  the  com¬ 
plexity  of  this  step  is  O(k.n),  because  each  edge  point 
belongs  to  at  most  one  ribbon  pair  per  discretized  ori¬ 
entation.  Here,  as  we  don’t  have  closed  contours,  a 
given  point  may  form  ribbon  pairs  with  any  other  point 
in  his  bucket  which  belongs  to  another  contour  frag¬ 
ment.  Thus,  the  complexity  is  roughly  multiplied  by 
the  number  e  of  edges  and  becomes  O(e.k.n),  which 
is  still  better  than  0(n2),  especially  in  high  resolution 
images. 

The  rest  of  the  algorithm  is  independent  of  the 
method  of  projections,  and  could  be  used  as  a  post¬ 
processing  stage  to  the  Asada  and  Brady  [1984]  algo¬ 
rithm  as  well.  In  the  second  and  third  steps  of  the 
algorithm,  it  is  important  to  represent  efficiently  the 
current  list  of  ribbon  pairs,  so  that  insertion  and  dele¬ 
tion  of  a  given  element,  as  well  as  popping  the  list,  can 
be  done  efficiently.  We  have  chosen  to  represent  this 
list  by  an  array  of  buckets,  indexed  by  the  cost  func¬ 
tion.  Each  bucket  itself  is  a  conventional  doubly  linked 
list,  sorted  according  to  the  cost  function. 

This  data  structure  allows  insertion  of  new  pairs  in 
almost  constant  time  (finding  the  right  bucket  is  done 
in  constant  time,  insertion  in  the  doubly  linked  list  cor¬ 
responding  to  the  bucket  depends  on  the  lists’s  length, 
which  is  in  general  bounded).  Deletion  is  done  in  con¬ 
stant  time  by  using  the  doubly  linked  lists.  Popping 
the  list  is  also  done  in  constant  time.  This  implies  that 
the  complexity  of  steps  2  and  3  is  proportional  to  the 
number  of  ribbon  pairs  generated,  so  it  is  also  roughlv 
0(eJ:  ;.n).  This  justifies  again  the  use  of  the  method  of 
projections. 

Two  ribbon  pairs  are  neighbors  iff  corresponding 
points  in  the  pairs  are  neighbors.  For  each  edgel,  we 
build  a  list  of  adjacent  ribbon  pairs  during  the  pro¬ 
jection  step.  In  particular,  when  a  ribbon  pair  rp  is 
popped  from  the  current  list,  the  incompatible  ribbon 
pairs  are  the  pairs  rp,' s  whose  both  ends  are  in  a  given 
neighborhood  of  the  ends  of  rp,  and  are  such  that  rp 
and  rp,  cross  each  other.  Finding  the  ribbon  pairs  in¬ 
compatible  with  a  given  ribbon  pair  is  straight  forward. 
Our  algorithm  solves  the  problem  of  choosing  between 
local  ribbon  descriptions  by  ensuring  that  the  “best" 
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ribbon  pairs  are  built  first,  forming  seeds  for  larger  rib¬ 
bons.  Notice  however  that  a  given  region  can  be  de¬ 
scribed  by  several  ribbons,  reflecting  the  different  sym¬ 
metries  of  that  region. 

2-4.  Results 

The  algorithm  has  been  implemented  for  Brady  rib¬ 
bons.  In  this  case,  the  choice  of  a  cost  function  is 
straightforward.  For  example,  let  a;  be  the  angle  be¬ 
tween  the  line  joining  the  end-points  p;(»  =  1,2)  in 
a  ribbon  pair  and  the  outward  normal  at  pi,  the  cost 
function  can  be  defined  as  |a2  —  orj |.  We  have  added  a 
last  step  to  the  segmentation  algorithm.  The  previous 
algorithm  assumes  that  ribbons  are  made  of  connected 
regions  of  ribbon  pairs.  In  read  images,  there  may  be 
gaps  between  ribbon  pairs.  Thus,  after  all  elementary 
ribbons  have  been  found,  one  more  grouping  step  is 
performed  to  merge  elementary  ribbons  into  larger  rib¬ 
bons,  based  on  a  distance  measure  between  the  ribbon- 
ends  (the  ends  of  a  ribbon  are  ribbon  pairs). 

Some  of  the  results  of  segmentation  on  the  test  im¬ 
age  are  shown  in  Figure  3.  Facets  of  the  valve  and  its 
handle  have  been  segmented  by  this  technique.  The 
algorithm  found  most  regions  which  a  human  would 
identify  as  being  symmetric.  There  were  a  few  false 
positives.  The  whole  program  took  1.5  hours  on  a  Sym¬ 
bolics  LISP  machine.  75%  of  this  time  was  spent  to  find 
ribbon  pairs,  to  discard  inconsistent  ribbon  pairs,  and 
to  group  them  into  ribbons.  Only  25%  of  the  time  was 
spent  in  projecting  contour  points.  The  situation  would 
be  inverted  if  the  0(n 7)  algorithm  had  been  used,  re¬ 
sulting  in  a  much  longer  computing  time. 

2-5.  Future  work 

In  the  implemented  algorithm,  we  have  only  consid¬ 
ered  Brady  ribbons.  The  algorithm  clearly  generalizes 
to  any  other  type  of  ribbons  such  that  some  cost  func¬ 
tion  can  be  assigned  to  any  potential  ribbon  pair  based 
on  local  measures.  In  their  original  algorithm,  Neva- 
tia  and  Binford  [1977]  do  not  use  a  cost  function,  but 
consider  pairs  of  adjacent  ribbon  pairs,  and  reject  the 
pairs  such  that  the  angle  a  between  the  line  joining 
their  mid-points  and  the  generators  is  too  far  from  90 
degrees.  Instead  of  using  these  pairs  of  ribbon  pairs,  we 
plan  to  find  right  Brooks  ribbons  by  replacing  edgels 
by  data  structures  pointing  to  adjacent  pairs  of  edgels 
(some  kind  of  “mid-edgels"),  and  using  them  to  gener¬ 
ate  ribbon  pairs,  with  a  cost  function  based  on  the  angle 
a.  L’nfortunately,  this  method  does  not  work  for  arbi¬ 
trary  Brooks  ribbons,  for  which  the  angle  between  gen¬ 
erators  and  spine  is  unknown.  In  [Ponce,  1988],  local 
characterizations  of  some  other  classes  of  Brooks  rib¬ 
bons,  such  as  worms  and  skewed  symmetries  [Kanade, 
1981],  are  described.  We  plan  to  implement  them  in 
the  n^ar  future. 


Figure  4.  The  junctions  found  by  the  algorithm  described 
in  section  3.2. 

3.  Graphs  of  junctions  and  edges 

In  this  section,  we  describe  an  algorithm  for  building 
an  edge-graph  from  imperfect  contour  data;  nodes  in 
the  graph  are  edge  junctions,  arcs  are  edges  between 
junctions.  An  application  of  this  algorithm  is  in  find¬ 
ing  geometrically  meaningful  regions  as  cycles  in  this 
graph.  The  basic  idea  here  is  to  use  boxes  enclos¬ 
ing  contour  segments  to  determine  the  curvilinearity 
of  contour  fragments  and  find  junctions. 

In  section  3.1,  we  briefly  describe  an  algorithm  for 
segmenting  contour  fragments  into  line  segments,  and 
enclosing  the  contours  into  rectangular  boxes.  These 
boxes  and  a  quadtree  representation  of  images  are  use  ! 
in  section  3.2  to  find  contour  junctions  and  build  the 
graph  representation.  This  graph  is  used  in  section  3.3 
to  find  minimal  cycles.  Section  3.-1  shows  and  discusses 
some  results. 

3-1.  Linear  approximation 

Robust  algorithms  for  segmenting  contours  into  line 
segments  are  useful  for  many  tasks,  including  recog¬ 
nition  of  two-  or  three-dimensional  objects  in  im¬ 
ages  (e.g.,  [Ayache  and  Faugeras,  19S6],  [Grimson  and 
Lozano-Perez,  1987],  [Lowe,  1987]).  In  this  section,  we 
briefly  discuss  a  new  algorithm  which  segments  a  con¬ 
tour  into  a  set  of  line  segments  by  using  a  bottom-up 
(merge)  approach.  We  use  the  algorithm  in  the  follow¬ 
ing  two  sub-sections  to  find  edge  junctions. 

Locally,  the  direction  of  each  edgel  is  given  by  the 
edge  detector  (e.g.,  in  Canny’s  case  [Canny,  198C],  by 
the  gradient  direction).  Adjacent  edgels  with  direc¬ 
tions  close  to  each  other  can  be  merged  to  form  seeds 


Figure  5.  Some  minimal  cycles  found  by  the  algorithm  of 
section  3.3. 

for  straight  lines.  Least  squares  linear  appoximations 
are  computed  for  these  seeds.  Adjacent  seeds  can  in 
turn  be  merged  into  longer  line  segments  if  the  result¬ 
ing  error  is  small  enough.  By  repeating  this  process, 
we  obtain  a  set  of  line  segments  approximating  the  con¬ 
tour. 

The  process  of  merging  line  pairs  is  analogous  to 
the  construction  of  ribbons  from  ribbon  pairs,  as  de¬ 
scribed  in  section  2.3.  We  use  a  similar  algorithm  and 
data  structure.  We  consider  one  edge  fragment  (i.e.,  a 
simply  connected  sequence  of  edgels)  at  a  time.  Line 
pairs  are  sorted  according  to  the  error  committed  when 
merging  them  into  a  single  line.  When  a  line  pair  is 
popped,  the  two  corresponding  lines  /, (i  =  1,2)  are 
merged  into  a  new  line.  All  line  pairs  containing  the 
lines  /,' s  are  displaced  within  the  list  of  line  pairs  after 
replacing  the  corresponding  /,  in  each  pair  by  the  new 
line  I,  and  computing  the  new  cost  of  the  line  pair. 

3-2.  Building  the  graph 

The  previous  algorithm  provides,  in  addition  to  the 
line  segments,  rectangular  boxes  centered  around  them 
and  enclosing  the  corresponding  contour  fragments: 
the  width  of  these  boxes  is  computed  by  taking  into 
account  the  least  squares  error  and  the  uncertainty  in 
the  edge  localization  (a  function  of  edge  operator  size 
and  contrast);  the  length  of  the  boxes  is  given  by  the 
length  of  the  line  segments  plus  twice  the  localization 
un^-ortaintv.  The  representation  obtained  is  analogous 
to  Ballard’s  strip  trees  [Ballard,  1981],  although  Bal¬ 
lard  uses  a  top-down  (split)  approach  to  segmentation. 

Boxes  can  in  turn  be  used  to  find  edge  junctions. 


A  junction  between  edges  is  defined  as  the  center  of 
gravity  of  the  intersection  of  boxes.  To  localize  in¬ 
tersections  between  boxes,  we  use  a  quadtree  whose 
leaves  are  empty  or  point  to  the  list  of  boxes  intersect¬ 
ing  them.  Thus,  we  need  only  to  compare  a  given  box 
to  the  other  boxes  associated  to  the  leaves  it  intersects. 

Edges  are  split  into  shorter  edges  at  junctions,  which 
form  nodes  in  a  graph.  An  arc  in  that  graph  is  labelled 
by  the  corresponding  edge,  and  relates  the  junctions  at 
the  ends  of  that  edge.  Notice  that  more  than  two  edges 
may  be  incident  at  a  given  junction  (T-junctions:  they 
are  obtained  for  example  when  the  box  associated  to 
one  edge  intersects  the  middle  part  of  an  other  edge), 
but  an  edge  has  at  most  two  junctions. 

3-3.  Finding  cycles 

The  graph  found  in  the  previous  section  may  have 
several  connected  components  (e.g.,  the  graph  in  Fig¬ 
ure  4  has  eight  connected  components).  In  this  section, 
we  give  an  algorithm  to  find  all  the  minimal  cycles  of 
a  given  component. 

A  minimal  cycle  is  a  cycle  which  does  not  enclose 
any  other  cycle  in  the  associated  connected  component. 
Minimal  cycles  are  found  by  using  the  left  shoulder 
algorithm  [Minsky,  1903]:  start  at  a  given  junction, 
and  traverse  the  graph  by  always  turning  left  as  much 
as  possible  at  each  junction;  if  an  edge  termination  is 
found,  backtrack  to  the  previous  junction  and  continue; 
stop  when  back  at  the  original  junction.  The  sequence 
of  edges  followed  is  a  minimal  cycle. 

By  repeating  this  process  for  all  junctions,  all  mini¬ 
mal  cycles  are  guaranteed  to  be  found.  Moreover,  any 
edge  (arc)  in  the  graph  can  be  traversed  in  two  di¬ 
rections,  and  may  belong  to  at  most  two  minimal  cy¬ 
cles.  By  marking  the  visited  edges  and  the  directions 
in  which  they  are  visited,  it  is  possible  to  avoid  looking 
for  cycles  along  edges  which  have  already  been  visited. 

For  each  cycle  found,  we  also  calculate  during  the 
construction  of  the  cycle  its  angle  of  turn,  obtained  by 
integrating  the  turns  made  from  edgel  to  cdgcl  along 
the  cycle.  This  allows  us  to  categorize  the  area  enclosed 
by  the  cycle:  if  the  angle  of  turn,  measured  counter¬ 
clockwise,  is  —  2j,  then  the  enclosed  area  is  the  interior 
of  the  cycle;  if  the  angle  of  turn  is  2s-,  then  it  is  the 
exterior  that  is  enclosed  by  the  cycle.  Each  connected 
component  can  have  at  most  one  cycle  enclosing  the 
exterior,  and  an  unlimited  number  of  cycles  enclosing 
the  interior. 

3-4.  Results 

The  graph  found  for  the  Smith  valve  image  is  shown 
in  Figure  4.  Notice  the  junctions  which  had  been 
missed  in  Figure  2.  Some  of  the  cycles  found  by  our 
algorithm  are  shown  in  Figure  5.  In  this  case,  there  are 
190  cycles  in  the  graph  structure.  Using  a  compactness 


measure,  23  cycles  are  left,  they  sensibly  correspond  to 
facets  of  the  valve.  The  algorithm  fails  to  segment  the 
handle  of  the  smith  valve  as  one  region  due  to  the  large 
number  of  cycles  within  this  area.  Note  that  the  handle 
had  been  segmented  as  one  region  by  the  ribbon  finder 
(Figure  3).  The  cycle  finding  algorithm  has  however 
found  the  most  important  facets  used  by  Binford  et  al. 
[Binford  1987]  for  prediction  and  matching.  Building 
the  graph  and  finding  the  minimal  cycles  takes  about 
15  minutes  on  a  Symbolics  LISP  machine.  Therefore, 
this  method  is  faster  than  the  ribbon  finding  approach. 


4.  Physical  segmentation 

In  addition  to  segmentation  based  on  image  geome¬ 
try  constraints,  it  is  also  possible  to  use  constraints  as¬ 
sociated  with  the  physics  of  image  formation  to  guide 
segmentation.  Physical  constraints  can  often  be  used 
to  resolve  ambiguities  that  exist  when  considering  only 
geometrical  information. 

In  the  next  two  sections  we  discuss  segmentation  al¬ 
gorithms  based  on  the  physics  of  image  formation.  In 
section  5,  we  present  an  algorithm  for  linking  edges 
based  on  color.  In  section  6,  we  assume  the  result  gen¬ 
erated  by  our  geometric  segmentation  algorithm  to  con¬ 
struct  a  scene  description  in  terms  of  surface  patches, 
their  properties,  and  their  relationships. 


5.  Edge  Linking  Using  Color 

There  has  been  progress  on  edge  linking  algorithms 
using  edgel  proximity  and  direction  [Nalwa,  1987].  Un¬ 
fortunately,  this  algorithm  is  local  in  character  and 
sometimes  fails  to  find  a  meaningful  object  boundary. 
It  has  been  shown  that  color  normalized  by  intensity 
is  relatively  stable  under  changes  in  geometry  [Healey, 
1987],  Thus,  normalized  color  information  can  be  used 
to  determine  whether  or  not  an  edge  is  a  material 
boundary.  In  this  section,  we  describe  an  algorithm  for 
linking  edges  based  on  normalized  color.  Some  results 
are  presented  in  5-2. 

5-1.  Bivariate  normal  color  model 

We  assume  that  intensity  is  the  arithmetic  average  of 
the  red( /?),  green(G)  and  blue(B)  components  and  that 
the  three  primaries  divided  by  intensity  are  relatively 
independent  of  surface  orientation  or  illumination.  The 
Karhunen-Loeve  transformation  was  used  to  find  prin¬ 
cipal  components  in  normalized  r,  g  and  b.  We  define 
two  components  of  normalized  color,  C\  and  G2,  by 
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Figure  6.  Edge  image 


Figure  7  Linked  edges  for  valve  handle 
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Intuitively,  Gj  is  an  odd  spectral  component  and  G2 
is  an  even  component.  The  normalized  color  compo¬ 
nents  of  a  pixel  ( r  g  b)  lie  on  the  r  +  g  +  b  =  3  plane, 
and  it  is  easy  to  verify  that  the  G)  and  G2  vectors  lie 
on  the  plane  and  are  perpendicular  to  each  other. 

Given  an  image  edge,  we  can  take  sets  of  pixel  sam¬ 
ples  from  each  side  of  the  edge.  We  approximate  the 
distributions  of  C\  and  G>  over  the  samples  on  one  side 
of  an  edge  by  the  bivariate  normal  distribution 


F(G,,G2)  = 


where, 


r(<~i,G2)  = 


2/>(Gi  -  in  )(Gj  -  /I  i)  +  (C2  -  p2 )? 

C\(T2  <J 

and  2  are  the  means,  and  rr\,rr2  are  the  standard 
deviations  of  Gi  and  G2  respectively,  p  is  the  correla¬ 
tion  coefficient.  The  two  normalized  color  components 
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Figure  8.  See  text 


and  the  bivariate  normal  approximation  have  been  used 
successfully  to  segment  outdoor  scenes  [Kong  1985]. 
5-2.  Experimental  Results 

An  implementation  of  Canny’s  edge  detector  [Canny, 
1986]  was  used  on  a  color  image  of  a  valve  part.  The 
edges  were  subsequently  divided  into  linear  segments. 
Pixels  along  the  perpendicular  direction  to  an  edgel 
were  sampled  on  each  side  of  the  edgel.  Pixels  within 
two  pixels  of  an  edgel  were  not  sampled  to  avoid  the 
color  blurring  which  can  occur  across  edgels.  This 
guard  size  of  two  pixels  was  found  to  be  reasonable 
for  the  valve  image.  The  sample  and  guard  size  can  be 
adaptively  changed  to  avoid  sampling  across  a  neigh¬ 
boring  edge. 

Parameters  for  the  bivariate  normal  distribution 
were  calculated  from  the  samples.  The  Mahalanobis 
norm  was  used  to  define  the  distance  between  sample 
means.  The  covariance  matrix  was  calculated  over  the 
entire  image  except  at  black  pixels. 

Let,  and  be  the  mean  vectors  of  colors  of  neigh¬ 
boring  edges. 


ti *  =  {via  til*) 
Hi  =  {tin  H2b)- 


Then  the  Mahalanobis  distance  d  between  the  two 


mean  vectors  is 


d  =  {ti*  -  /'i)D  l{na  -  H>Y 


where,  E  is  the  covariance  matrix  over  the  image. 


Edgels  near  the  end  point  of  an  edge  are  chosen,  and 
the  Mahalanobis  distance  is  used  to  select  the  be3t  edge 
for  linking. 

The  edges  for  the  valve  image  are  shown  in  fig.  6,  and 
a  set  of  linked  edges  defining  the  yellow  plastic  handle 
is  shown  in  fig.  7.  Ellipses  with  U(C\,C2)  =  1  for  the 
bivariate  normal  approximation  of  the  linked  edge  color 
are  shown  in  fig.  8. 

Performance  of  the  linking  algorithm  is  strongly  in¬ 
fluenced  by  the  output  of  the  edge  detector.  The  ability 
of  an  edge  detector  to  locate  T  junctions  is  crucial  in 
linking. 

With  the  correction  of  camera  response  nonlinearity, 
we  expect  that  the  stability  of  the  normalized  color  and 
the  linking  algorithm  will  be  enhanced. 


6.  Generating  Physical  Descriptions 


Given  the  geometric  segmentation  of  an  image  into 
edges  and  regions,  it  is  desirable  to  compute  physical 
descriptions  of  the  surface  patches  producing  image  re¬ 
gions  and  to  generate  hypotheses  about  the  physical 
causes  of  the  image  edges  separating  regions.  Inferring 
this  kind  of  information  allows  us  to  go  from  our  geo¬ 
metrical  description  of  the  image  to  a  physical  descrip¬ 
tion  of  the  3D  scene.  In  this  section,  we  discuss  an  ap¬ 
proach  to  the  problem  of  using  our  geometric  segmen¬ 
tation  to  generate  a  description  of  the  scene  in  terms 
of  surface  patches,  their  properties,  and  their  relation¬ 
ships. 

It  is  useful  for  segmentation  to  separate  the  inten¬ 
sity  and  spectral  attributes  of  the  light  reflected  from 
a  surface.  This  i3  because  intensity  provides  strong  in¬ 
formation  about  geometry,  while  color  provides  strong 
information  about  material  composition  [Shafer.  1984], 
[Healey,  1987],  Since  one  of  the  goals  of  physical  seg¬ 
mentation  is  to  distinguish  geometrical  variation  from 
material  variation  in  the  scene,  computing  distinct  in¬ 
tensity  and  color  descriptions  for  each  pixel  in  the  im¬ 
age  is  an  important  step  towards  physical  segmenta¬ 
tion.  In  this  work,  we  use  the  method  described  in  [7] 
to  compute  distinct  intensity  and  color  descriptions  for 
each  pixel  in  an  image  region. 

6-1.  Relating  Surface  Patches 

In  this  subsection,  we  examine  the  problem  of  in¬ 
ferring  the  relationships  between  surface  patches.  An 
important  part  of  solving  this  problem  is  determining 
the  physical  causes  of  the  image  edges  separating  ad¬ 
jacent  image  regions.  We  currently  consider  the  sim¬ 
plified  case  of  a  scene  illuminated  by  a  single  spectral 
power  distribution,  with  no  image  edges  caused  by  il- 
lumir.-,*ion  discontinuities. 


As  a  first  step  towards  general  physical  segmentation, 


Figure  9.  An  edge  image  with  labeled  regions 


we  have  developed  a  procedure  which  locates  regions 
corresponding  to  surface  patches  of  approximately  the 
same  spectral  reflectance.  The  procedure  consists  of 
computing  the  normalized  physical  color  (NPC)  [7]  of 
each  image  region  and  finding  those  regions  which  are 
sufficiently  close  in  NPC  space  according  to  our  color 
metric  [7],  Given  our  assumption  of  a  single  spec¬ 
tral  power  distribution  of  illumination,  regions  which 
have  similar  normalized  physical  colors  will  have  sim¬ 
ilar  spectral  reflectance  functions.  If  our  scene  con¬ 
tains  a  small  number  of  objects,  it  is  likely  that  surface 
patches  with  similar  spectral  reflectance  functions  are 
made  of  the  same  material.  Also,  unless  there  is  ev¬ 
idence  to  the  contrary,  adjacent  image  regions  whose 
surface  patches  have  similar  spectral  reflectance  func¬ 
tions  probably  correspond  to  surface  patches  on  a  single 
object  which  are  separated  by  a  geometrical  disconti¬ 
nuity. 

6-2.  Results 

We  have  tested  our  procedure  on  an  image  of  the 
valve  part.  Figure  9  shows  an  edge  image  obtained 
using  Canny's  edge  detector  [Canny,  1986].  In  Fig¬ 
ure  9  we  consider  nine  regions  which  are  similar 
to  those  found  by  the  geometrical  segmentation  algo¬ 
rithm.  Figure  10  shows  the  coordinates  of  these  re¬ 
gions  in  NPC  space.  The  cluster  formed  by  regions 
1.2, 5, 6, 7  in  Figure  10  indicates  that  the  correspond¬ 
ing  surface  patches  have  approximately  the  same  spec¬ 
tral  reflectance.  These  regions,  in  fact,  correspond  to 
surface  patches  of  the  same  material  (the  brass  ma¬ 
terial  on  the  valve).  The  cluster  formed  by  regions 
3,4,8  indicates  that  their  surface  patches  have  nearly 
the  same  spectral  reflectance.  These  regions,  in  fact, 
correspond  to  silver  colored  patches  on  the  valve.  Re¬ 
gion  10,  the  yellow  handle,  maps  to  a  separate  part  of 
NPC  space.  Thus,  we  see  that  this  procedure  is  very 
useful  for  grouping  together  surface  patches  which  are 
composed  of  the  same  material.  This  information  can 
readily  be  applied  to  object  recognition. 

It  is  important  to  emphasize  that  there  are  large  dif¬ 
ferences  in  the  image  irradiance  values  corresponding  to 
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Figure  10.  Region  coordinates  in  NPC  space 

the  regions  1,2, 5, 6, 7  and  also  to  the  regions  3,4,8.  This 
is  because  of  differences  in  the  orientation  of  the  sur¬ 
face  patches.  Therefore,  a  clustering  technique  based 
on  image  irradiance  would  not  be  able  to  group  these 
sets  of  regions.  It  is  only  by  using  normalized  color  that 
we  are  able  to  group  together  regions  corresponding  to 
surface  patches  made  of  the  same  material. 
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ABSTRACT 


We  present  a  method  to  smooth  a  signal  -  whether  be  it  an 
intensity  image,  a  range  image  or  a  contour  -  which  pre¬ 
serves  discontinuities  and  facilitates  their  detection.  This 
is  achieved  by  repeatedly  convolving  the  image  with  a  very 
small  averaging  filter  modulated  by  a  measure  of  the  “con¬ 
tinuousness”  at  each  pixel.  This  process  is  related  to  the 
Anisotropic  Diffusion  reported  by  Perona  and  Malik  re¬ 
cently,  but  is  not  subject  to  instability  or  divergence.  We 
first  show  how  this  basic  approach  can  be  applied  when 
it  is  reasonable  to  model  an  image  as  piecewise  constant 
(such  as  most  aerial  our  outdoor  images),  then  turn  to  the 
case  when  it  becomes  necessary  to  detect  discontinuities  of 
the  first  derivatives,  as  is  the  case  for  range  images,  and  fi¬ 
nally  present  the  case  when  the  input  signal  is  a  connected 
curve.  The  method  is  extremely  attractive  as  there  is  a 
single  parameter  to  adjust,  analogous  to  the  variance  n  in 
the  Scale-Space  paradigm,  and  produces  a  robust  segmen¬ 
tation  as  no  tracking  is  needed.  We  illustrate  the  method 
with  results  on  real  intensity  and  range  images,  and  on 
actual  contours. 


1  Introduction 


In  order  to  successfully  accomplish  the  understanding  of 
images,  it  is  necessary  to  transform  the  viewer-centered  in¬ 
put  into  object-centered  descriptions.  Physical  boundaries 
of  objects  are  very  important  descriptors  and  are  likely  to 
generate  edges  during  the  imap- process.  Even  though 
the  reverse  is  not  true,  it  is  rea:  liable  to  assume  that  the 
early  stages  in  image  analysis  consist  of  detecting  such  dis¬ 
continuities.  Due  to  the  complexity  of  the  physical  world 
and  of  the  imaging  apparatus,  and  to  multiple  sources  of 
noise,  the  signal  to  be  processed  is  complex,  and  the  detec¬ 
tion  of  such  discontinuities  is  non  trivial.  Features  detected 
locally  are  validated  only  by  considering  a  more  global  con¬ 
text.  We  begin  by  reviewing  the  major  approaches  clio- 
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sen  to  tackle  this  challenging  problem  and  classify  them  in 
three  categories: 


Fixed  Mask  methods ,  in  which  the  model  consists  of 
a  single  edge,  and  the  coefficients  of  the  convolu¬ 
tion  mask  are  evaluated  to  best  detect  such  an  edge. 
An  example  of  such  a  scheme  is  gaussian  smooth¬ 
ing.  The  main  drawback  of  these  methods  is  that, 
since  they  do  not  explicitely  model  multiple  edges  in 
the  smoothing  window,  edges  are  displaced  or  not 
detected  if  they  occur  too  close  together. 

Scale-Space  methods  propose  to  look  at  the  signal  af¬ 
ter  smoothing  with  multiple  size  masks.  Since  they 
use  multiple  scales,  they  need  to  either  solve  a  non 
trivial  correspondence  problem  or  to  compute  a  very- 
large  number  of  close  scales. 

Adaptive  methods  either  change  the  coefficients  or  *he 
size  of  the  masks  based  on  the  values  in  a  window. 
Some  of  them  are  computationally  expensive  or  use  a 
rather  adhoc  strategy,  but  recently  Perona  and  Ma¬ 
lik  [24]  presented  an  interesting  formulation  of  the 
problem. 


Following  the  survey,  we  present  a  simpler  variation  of  this 
last  approach,  in  which  we  iteratively  convolve  the  image 
with  a  mask  whose  coefficients  reflect  the  degree  of  continu¬ 
ity-  of  the  underlying  image  surface.  This  Adaptive  Filter  is 
applicable  when  it  is  reasonable  to  assume  that  the  image 
is  piecewise  constant.  We  then  study  images  in  which  this 
assumption  does  not  hold  and  where  we  have  to  detect  roof 
edges  in  addition  to  step  edges,  such  as  range  images.  We 
also  present  a  short  extension  of  the  method  to  1-D  signals 
such  as  the  bounding  contour  of  an  object.  In  all  rases, 
we  show  that  the  desired  features  are  enhanced  and  their 
detection  becomes  very  easy,  and  illustrate  our  claims  with 
results  on  real  signals.  Finally,  we  discuss  the  limitations 
of  the  approach  and  propose  some  possible  improvements 
and  extensions. 


$ 


2  Previous  Work 


As  mentioned  in  the  introduction,  it  is  convenient  to  clas¬ 
sify  the  Feature  Extraction  methods  in  three  classes,  even 
though  the  boundaries  between  them  may  not  be  sharp. 

2.1  Fixed  Mask  Methods 

In  this  group,  the  authors  consider  the  case  of  a  single  edge. 
The  goal  of  their  methods  is  to  find  the  optimal  filter  (in 
terms  of  signal  to  noise  ratio)  for  the  detection  of  such  an 
edge.  We  only  report  the  major  ideas  in  this  domain.  Shan- 
mugam  et  al  [29]  define  an  edge  as  a  step  discontinuity  be¬ 
tween  regions  of  uniform  intensity  and  show  that  the  ideal 
filter  is  given  by  a  prolate  spheroical  wave  function.  Marr 
and  Hildreth  [19],  extending  the  work  of  Marr  and  Pog- 
gio  [18],  convolve  the  signal  with  a  rotationally  symmet¬ 
ric  Laplacian  of  Gaussian  mask  and  locate  zero-crossings 
of  the  resulting  signal.  In  their  work,  they  already  men¬ 
tion  that  a  multiple  scale  approach  is  necessary,  pointing 
out  the  difficult  problem  of  integration.  Haralick  [14]  lo¬ 
cates  edges  at  the  zero-crossings  of  the  second  directional 
derivative  in  the  direction  of  the  gradient  where  deriva¬ 
tives  are  computed  by  interpolating  the  data.  In  [13],  his 
facet  model  is  extended  to  the  Topographic  Primal  Sketch. 
In  a  recent  paper,  Shen  and  Castan  [30]  propose  an  op¬ 
timal  linear  filter  in  which  the  image  is  convolved  with 
the  smoothing  function  fix)  =  —  |(lnfc)6^  prior  differ¬ 
entiation.  They  claim  better  localization  than  the  Marr- 
Hildreth  zero-crossing  detector. 

Since  these  methods  ignore  the  occurence  of  multiple  in- 
terferring  edges,  they  typically  displace  the  true  location 
of  edges,  or  worse,  fail  to  detect  some  of  them.  When  the 
problem  is  displacement  only,  a  nice  method  can  recover 
the  true  location  of  the  edges  [7],  but  still  leaves  the  no- 
detection  problem  open. 

2.2  Scale-Space  Methods 

As  noted  by  several  authors  in  the  first  group,  the  auto¬ 
matic  adjustement  of  the  size  (or  scale)  parameter  is  diffi¬ 
cult.  so  using  multiple  scales  should  provide  a  reasonable 
answer.  This  idea  is  based  on  some  physiological  work 
echoed  in  [17]  for  a  few  scales,  but  the  integration  of  these 
discrete  scales  is  an  open  problem.  Instead  of  using  dis¬ 
crete  listant  scales.  Wit  kin  [33]  proposed  a  continuum  of 
scales  and  showed  that,  at  least  in  1-D.  the  interpretation 
of  the  multiscale  response  made  the  important  information 
explicit.  In  the  2-D  rase,  the  discretization  of  the  formula¬ 
tion  leads  to  large  amount  of  memory  allocation  necessary, 
su  h  as  in  edge  focusing  [3],  otherwise  heuristics  need  to  be 
anplied  to  establish  correspondence  between  scales.  This 


was  done  with  some  success  by  Asada  and  Brady  for  2-D 
curves  [1]  in  their  Curvature  Primal  Sketch,  then  by  Ponce 
and  Brady  [25]  and  Fan  et  al  [10]  in  the  case  of  surfaces. 
In  his  paper  [6],  Canny  defines  a  set  of  criteria  for  the  in¬ 
tegration  of  multiple  size  masks,  but  his  implementation  is 
with  fixed  size  masks  only. 

2.3  Adaptive  Methods 

The  general  idea  behind  adaptive  smoothing  is  to  evalu¬ 
ate  the  best  smoothing  mask  in  function  of  the  available 
information  in  the  signal  to  be  smoothed.  Even  thougn 
detailed  overviews  of  some  Adaptive  Smoothing  methods 
can  be  found  in  [8,20]  in  which  some  evaluations  are  also 
performed,  it  is  interesting  to  recall  some  ideas  which  form 
the  basis  of  many  of  these  methods,  including  the  newest 
approaches. 

One  of  the  first  interesting  investigation  in  this  field  may 
be  found  in  [16]  in  which  Rosenfeld  et  al  propose  some  it¬ 
erative  weighted  averaging  methods.  In  particular,  they 
propose  to  apply  at  each  point  a  weighted  mask  whose 
coefficients  are  based  on  an  evaluation  of  the  difference  be¬ 
tween  the  value  at  the  center  point  and  the  values  of  its 
neighbors.  A  similar  approach,  but  simpler,  can  be  found 
in  [32],  in  which  the  weighting -coefficients  are  the  normal¬ 
ized  gradient-inverse  between  the  current  point  and  each 
neighbor.  Another  method  consists  in  selecting  the  neigh¬ 
bor  points  which  have  the  closest  value  from  the  value 
of  the  central  point  and  replace  the  latter  by  the  aver¬ 
age  of  these  values  [9],  More  sophisticated  methods  are 
based  upon  local  statistic  studies  of  each  point  neighbor¬ 
hood  [20].  The  major  drawback  of  these  iterated  smooth¬ 
ing  methods  is  that  their  convergence  capability  toward 
a  stable  state  is  not  clear.  In  more  recent  papers  [5,25], 
Brady  et  al  prevent  smoothing  across  previously  detected 
discontinuities  by  using  computational  molecules  proposed 
by  Terzopoulos  [31].  This  implies  that  they  already  have 
detected  discontinuities,  which  is  the  problem  we  have  to 
solve!  Geman  and  Geman  [12]  appeal  to  the  simulated  an¬ 
nealing  which  is  computationally  very  expensive  but  leads 
to  impressive  results  which  are  shown  only  on  gray  scale 
images  with  a  few  gray  levels.  A  completely  different  ap¬ 
proach  which  makes  use  of  the  curvature  of  the  underlying 
image  surface  is  proposed  by  Saint-Marc  and  Richetin  [28] 
in  which  they  apply  a  directional  mask  in  the  direction  of 
least,  curvature  in  highly  curved  areas  or  a  standard  aver¬ 
aging  square  mask  otherwise.  The  results  are  impressive, 
but  this  method  is  only  applicable  to  surfaces  smoothing, 
not  to  planar  curves  for  instance.  Parvin  and  Medioni  [23] 
present  a  method  to  extract  meaningful  features  from  range 
images.  Their  strategy  consists  of  selecting  automatically 
an  adequate  kernel  size  for  the  detection  and  localization  of 
such  features.  Although  the  method  requires  the  non  obvi- 
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ous  setting  of  five  parameters,  it  provides  directly  features 
at  different  scales  and  is  applicable  for  curves.  Finally, 
Perona  and  Malik  [24]  have  proposed  to  cast  the  problem 
in  terms  of  the  heat  equation  [15],  but  in  an  anisotropic 
medium.  They  have  presented  impressive  results  on  com¬ 
plex  images.  The  method  presented  here  is  a  refinement 
and  a  generalization  of  the  same  idea. 


3  Adaptive  Smoothing 

In  this  section,  we  introduce  our  Adaptive  Filter.  Its  per¬ 
formance  is  illustrated  on  a  1-D  signal,  then  on  an  intensity 
image,  both  of  them  approximatively  piecewise  constant. 
We  point  out  its  limitations  and  offer  solutions  allowing  us 
to  deal  with  more  complex  signals. 


3.1  Principle 

By  far  the  most  common  filter  used  in  smoothing  is  the 
Gaussian  filter.  As  pointed  out  by  many  authors,  such  a 
filter  has  very  desirable  properties,  in  particular  no  new 
zero-crossings  appear  in  the  Laplacian  of  the  signal  as  a 
increases  [34],  In  addition,  ^c.ussian  convolution  can  be 
computed  efficiently  by  a  cascade  of  convolutions  with  any 
finite  filter  [4,5,6],  in  particular  a  small  mask  with  all  coef¬ 
ficients  set  to  1.  In  the  case  of  an  image,  we  formulate  this 
process  as  follows  in  order  to  later  introduce  our  Adaptive 
Filter  more  easily.  Let  the  image  be  expressed  as  I(i,  y,  0) 
before  smoothing,  and  let  C(x,y)  be  a  coefficient  image  of 
same  size  such  that  V.r  and  .'y,  C(x,y)  =  1,  the  filtered 
image  /(x,y,n  4-  1)  at  the  (n  +  l)"1  iteration  is  simply: 


r,y.  n  +  1)  =  —  Y  Y  I(x  +  k\  y  +  l.n)C(x  +  k.y  +  l) 


X  =  Y  YL  Hr  +  k.y  +  l) 

k=- 1  l=- 1 

But  it  is  well  known  that  this  filter  smooths  the  data  every¬ 
where.  even  across  the  depth  discontinuities.  If  we  assume 
that  we  already  know  the  location  of  these  discontinuities, 
then  if  we  set  the  corresponding  points  of  the  coefficient 
image  C(x.y)  to  0,  smoothing  a  point  near  a  discontinu¬ 
ity  will  not  fake  into  account  those  points  belonging  to  the 

discontinuity.  As  we  use  a  small  mask,  there  will  be  no  risk 
to  merge  points  belonging  to  different  regions.  For  those 
points  belonging  to  discontinuities,  the  repeated  averag¬ 
ing  process  will  force  them  to  belong  to  one  of  the  nearby 
regions.  f hereffire  enhancing  the  discontinuities.  Unfortu¬ 
nately,  we  do  not  know  the  location  of  the  discontinuities, 
otherwise  our  problem  would  be  solved  and  would  not  need 


any  smoothing...  Instead,  we  can  formulate  a  guess  by 
computing  at  each  point  a  continuity  value  using  as  in  [24] 
any  monotieally  decreasing  function  C  such  that  C(0)  =  1 
and  C(d)  — »  0  as  il  increases,  where  the  variable  d  repre¬ 
sents  the  “chance"  of  having  a  discontinuity  at  that  point. 
An  estimate  of  d  can  be  computed  simply  by  relating  its 
value  to  the  value  of  the  gradient  at  that  point  (if  we  as¬ 
sume  that  the  original  signal  is  approximatively  piecewise 
constant)  or  it  can  be  more  elaborated  as  suggested  in  [24]. 

Let  us  now  formulate  our  Adaptive  Smoothing  scheme.  Let 
the  image  be  expressed  as  7(x,  y,  0)  before  smoothing,  and 
7(x,y,n)  at  the  n,h  iteration.  From  7(x,  y,n),  we  can  de¬ 
termine  the  coefficient  image  C(x,  y,rc)  by  computing  at 
each  point  the  continuity  value  using  the  previously  de¬ 
fined  function  C.  The  smoothed  version  of  7(x,y,n)  is 
then  defined  at  each  point  (x,  y)  by: 


l  +i  +i 

7(x,y,n  +  l)  =  —  Y  Y  I(x  +  k,y  +  l,n)C(x  +  k,y  +  l,n) 


R=  Y  £  C(x  +  k.y  +  l,n) 

k~- 1  /=-] 

Mote  that  this  formulation  is  considerably  simpler  than  the 
one  given  in  [24].  Our  experiments  with  their  original  for¬ 
mula  have  led  to  numerical  instabilities  and  divergence, 
even  for  simple  signals.  The  new  equation,  as  we  will  see, 
consistently  gives  stable  results.  So  far,  we  have  been  un¬ 
able  to  formally  establish  the  equivalence  between  both 
formulations. 


3.2  Experiments 

3.2.1  1-D  Signal  Adaptive  Smoothing 

We  illustrate  our  algorithm  on  a  1-D  signal  which  con¬ 
sists  of  a  slice  taken  horizontally  from  an  approximatively 
piecewise  constant  gray  level  image  shown  in  figure  1(a),  as 
shown  in  figure  1(b).  As  one  can  notice,  this  noisy  signal 
contains  discontinuities  of  different  strengths.  We  simply 
take  for  the  estimate  of  d  the  absolute  value  of  the  deriva¬ 
tive.  We  use  for  C  the  same  function  as  in  [24],  that  is: 

C(d)  =  c<-k-4) 

The  result  of  the  application  of  our  Adaptive  Filter  to  this 
signal  after  250  iterations  can  be  seen  in  figure  1(e)  with 
the  value  of  k  set,  to  0.1.  Notice  how  smooth  it  is  and 
how  well  the  discontinuities  are  preserved.  As  expected, 
the  resulting  signal  is  a  piecewise  constant  function.  Fig¬ 
ure  l(f )  illustrates  the  behavior  of  the  zero-crossings  of  the 
second  derivative  of  the  signal  as  the  iteration  progresses. 
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This  Scale-Space  like  representation  really  demonstrates 
that  features  detected  at  a  coarser  level  (that  is  after  a 
certain  amount  of  smoothing)  do  not  need  to  be  tracked 
along  the  “scale”  dimension  in  order  to  find  their  precise  lo¬ 
cation  as  in  the  case  of  the  standard  Scale-Space  paradigm. 
For  comparison,  figure  1(c)  shows  the  results  obtained  with 
k  =  0,  which  is  identical  to  iteratively  applying  a  mean  fil¬ 
ter,  which  is  also  equivalent  to  convolving  the  signal  with 
a  Gaussian  Filter.  Not  surprisingly,  discontinuities  com¬ 
pletely  disappear  after  a  number  of  iterations.  The  dia¬ 
gram  in  figure  1(d)  shows  how  zero-crossings  migrate  and 
merge  as  the  iteration  progresses  with  k  =  0,  which  is 
equivalent  to  increasing  the  a  parameter  of  a  Gaussian  con¬ 
volution. 


The  method  depends  on  a  single  parameter  k,  and  its  value 
is  critical,  as  it  directly  determines  which  discontinuities 
will  remain  during  the  iterative  process.  If  k  is  taken  to 
be  small,  then  every  edge  will  diffuse,  and  the  result  will 
be  the  same  as  with  Gaussian  smoothing.  If  k  is  taken  to 
be  large,  then  every  edge  will  stop  the  diffusion,  and  no 
smoothing  will  happen.  Choosing  the  value  of  k  is  equiva¬ 
lent  to  choosing  the  value  of  a  in  the  standard  Scale-Space 
paradigm  from  which  tracking  will  be  performed.  But,  in 
our  case,  no  tracking  along  the  “scale”  dimension  is  needed. 


(a)  Intensity  Image 
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(b)  1-D  Signal  extracted  from  (a) 
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(c)  Gaussian  Smoothing  of  (b) 
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(e)  Adaptive  Smoothing  of  (b) 
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(f)  Zero-Crossings  of  (b)  at  each  iteration  (Adaptive) 


Figure  1:  Adaptive  Smoothing  of  a  1-D  Signal. 
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3.2.2  Intensity  Image  Adaptive  Smoothing 

When  the  signal  is  a  2-D  image,  we  define  d  as  the  norm 
of  the  gradient  (§~,  §^)T  —  (Gz,Gy)T ,  computed  in  a  3  x  3 
window.  Therefore,  the  continuity  value  C  is  given  by  the 
following  equations: 

C(d)  = 
with 

d  =  \jGl  +  Gl 

We  present  the  results  of  the  Adaptive  Filter  to  the  image 
in  Figure  2(a),  which  is  a  typical  scene  in  the  environment 
of  DARPA’s  Autonomous  Land  Vehicle.  Figure  2(b)  shows 
the  result  obtained  after  50  iterations  with  the  parameter 
k  set  to  0.01.  Not  surprisingly,  the  results  are  similar  to 
the  ones  obtained  for  the  1-D  signal.  The  edges  are  very 
well  preserved  while  the  image  has  been  smoothed  within 
regions.  These  results  are  comparable  to  those  shown  by 
Perona  and  Malik  in  their  paper  [24],  but  we  have  not  been 
able  to  replicate  their  results  using  their  equations. 

3.3  Limitations  of  the  method  and  Solu¬ 
tions 

What  happens  if  the  original  signal  cannot  be  considered  as 
approximatively  piecewise  constant?  This  question  is  very 

important  if  we  wish  to  apply  our  Adaptive  Filter  to  the 
smoothing  of  range  images  for  instance.  Furthermore,  step 
discontinuities  are  not  the  only  discontinuities  that  need  to 
be  preserved.  In  range  image,  one  needs  to  preserve  other 
types  of  discontinuities,  such  as  creases.  Figure  3(a)  shows 
the  126  x  131  x  8  bits  range  image  of  a  toy  chair  obtained 
in  our  laboratory,  to  which  Gaussian  noise  (a  =  10)  has 
been  added.  Here,  depth  is  directly  encoded  by  gray-level. 
A  better  representation  is  a  3-D  plot  of  the  scene,  as  shown 
in  figure  3(b).  From  this  image,  we  extract  a  vertical  slice 
displayed  in  figure  3(c).  We  apply  to  this  1-D  signal  the 
same  algorithm  as  in  3.2.1,  leading  after  250  iterations,  to 
the  smoothed  signal  shown  in  figure  3(d).  The  step  discon¬ 
tinuities  are  obviously  still  preserved,  but  as  the  signal  is 
not  piecewise  constant,  areas  with  significant  values  of  the 
derivative  have  been  transformed  into  staircase  type  sig¬ 
nals,  which  is  quite  disappointing.  Furthermore,  tangent 
orientation  discontinuities  are  not  preserved. 


(a)  Original  Intensity  Image 


(b)  Adaptive  Smoothing  of  (a) 


Figure  2:  Adaptive  Smoothing  of  an  Intensity  Image. 

There  is  an  elegant  solution  that  solves  both  problems  at 
the  same  time:  instead  of  applying  the  Adaptive  Filter  to 
the  signal  itself,  we  apply  it  to  its  derivative.  Figure  3(e) 
shows  the  result  obtained  with  this  method.  Since  we  fil¬ 
ter  the  derivative  only,  the  smoothed  signal  is  not  directly 
accessible.  For  the  purpose  of  display  only,  we  have  per¬ 
formed  a  numerical  integration  of  the  smoothed  derivative. 
The  noise  has  been  removed  while  discontinuities,  includ¬ 
ing  tangent  discontinuities,  have  been  preserved.  Finally, 
figure  3(f)  shows  in  Scale-Space  like  representation  the  lo¬ 
cation  of  tin*  extrema  of  the  second  derivative  as  the  iter¬ 
ation  progresses.  These  features  which  correspond  to  dis¬ 
continuities  of  the  first  derivative  are  very  well  preserved 
along  the  “scale”  dimension. 


(a)  Range  Image 
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(d)  Smoothing  of  (c)  with  the  previous  method 


(e)  Integration  of  the  Smoothed  Derivative  of  (b) 


(c)  1-D  Signal  extracted  from  (a) 


Scale-Space  Representation 

(f )  Extrema  of  the  Second  Derivative  at  each  iteration 


Figure  3:  Preserving  First  Derivative  Discontinuities. 


4  Application  to  Range  Image  Seg¬ 
mentation 

In  this  section,  we  show  how  our  Adaptive  Filter  previously 
introduced  can  be  extended  to  the  smoothing  of  range 
data.  As  this  smoothing  scheme  preserves  important  sur¬ 
face  changes,  they  are  then  very  well  extracted  by  standard 
means.  Furthermore,  as  the  resulting  surfaces  are  smooth, 
low  level  frequency  events  are  also  detectable. 


4.1  Range  Image  Adaptive  Smoothing 

The  smoothing  is  performed  as  follows.  Given  the  original 
range  image  R(x,  y,  0),  we  first  compute  the  original  deriva¬ 
tive  images  P(x,y,0)  =  and  Q(x,y,0)  —  , 

Let  P(x,  i/,  n)  and  Q{x,y,n)  the  derivative  images  at  the 
nth  iteration.  From  these  two  images  we  compute  the  co¬ 
efficient  image  C(x,  y,n)  using  at  each  point  the  following 
formula: 


V*  XF.KAKTWKX  KJT  V*  VITO ^rsprKTVY\,  rV<V  ^Or«.wW*V -T*  VV  W~-K7J 


C(d )  =  ef-*1'2) 


,  A  5P 

d  =  A  =  —  +  ~^- 
dx  dy 

(Laplacian) 

The  smoothed  versions  of  P(x,  y,n)  and  Q{x,  y,  n)  are  then 
obtained  by  using  the  coefficient  image  C(x,y,n)  previ¬ 
ously  defined,  as  follows: 


P(x,  y,n  +  l)  =  i  E  E  P(x  +  k,y  +  l,n)C(x  +  k,y  +  l,n) 


1  +1  +1 

Q(x,y,n  +  l)  =  -  E  E  Q(x  +  k,y+l,n)C(x  +  k,y+l,n) 


S=EE  C(x  +  k,y  +  l,n) 

*r=  — 1  l=-l 

It  is  important  to  notice  that  C(x,y.n)  simultaneously  re¬ 
flects  the  continuity  of  both  P(x,y,n)  and  Q(x,y,n),  so 
that  the  smoothing  of  P(x,y,n)  and  Q(x,y,n )  cannot  be 
performed  independently,  which  is  legitimate.  The  choice 
of  the  Laplacian  is  not  unique,  but  is  the  simplest.  It  would 
have  been  possible  for  instance  to  consider  the  quadratic 
variation  or  another  measure  of  the  local  variation  of  the 
first  derivatives. 


(a)  Shaded  Image  of  the  Range  Image  of  a  chair 


We  have  tested  the  algorithm  on  several  range  images  with 
success.  In  order  to  visualize  the  performance  of  the  Adap¬ 
tive  Filter  applied  to  a  range  image,  a  possible  solution  is 
to  integrate  the  smoothed  derivative  images  P  and  Q  in 
order  to  obtain  a  smoothed  range  image  and  then  take  a 
3-D  view.  Instead,  we  found  more  practical  to  directly 
use  P  and  Q  and  compute  with  these  the  shaded  image 
of  the  smoothed  range  image.  Figure  4  shows  several  re¬ 
sults  obtained  with  the  range  image  of  figure  3(a)  whose 
shaded  image  is  shown  in  figure  4(a).  This  range  image 
includes  step  discontinuities  corresponding  to  transitions 
between  the  object  and  the  background,  and  also  roof  dis¬ 
continuities  corresponding  to  transitions  between  planes. 
Figure  4(d)  shows  the  result  obtained  after  only  10  itera¬ 
tions  of  our  algorithm,  in  which  k  =  0.1.  As  one  can  see, 
the  image  is  almost  perfectly  smooth,  without  any  blurring 
around  the  discontinuities.  As  a  comparison,  we  show  the 
results  obtained  after  simply  applying  Gaussian  convolu¬ 
tion  to  the  range  image  with  a  =  2  (figure  4(b))  then  with 
<7=4  (figure  4(c)).  Even  with  <7  =  4,  the  noise  is  not  com¬ 
pletely  removed  (see  the  background)  and  in  both  cases, 
the  “edges”  are  very  much  blurred.  For  completeness,  we 
have  also  applied  our  Adaptive  Filter  to  the  smoothing  of 
the  126  x  160  x  8  bits  range  image  of  a  more  complex  object 
(a  tooth)  whose  shaded  image  is  displayed  in  Figure  5(a). 
Besides  the  presence  of  step  and  roof  discontinuities,  this 


(b)  Shaded  Image  after  Gaussian  Smoothing  (i7  =  2) 


:)  Shaded  Image  after  Gaussian  Smoothing  (cr  =  4)  (d)  Shaded  Image  after  Adaptive  Smoothing  of  its  Derivatives 
Figure  4:  Adaptive  Smoothing  of  the  Range  Image  of  a  chair. 


(a)  Shaded  Image  of  the  Range  Image  of  a  tooth 


(b)  Shaded  Image  after  Adaptive  Smoothing  of  its  Derivatives 


Figure  5:  Adaptive  Smoothing  of  the  Range  Image  of  a  tooth. 


image  is  of  particular  interest  since  its  surface  is  curved. 
The  result  after  ten  iterations  with  k  =  0.1  is  shown  in 
figure  5(b).  The  discontinuities  are  very  well  preserved 
while  the  curved  surface  is  now  completely  smoothed  as 
the  irregularities  of  the  original  surface  have  completely 
disappeared. 

4.2  Extraction  of  Meaningful  Features 

The  pictures  presented  in  the  previous  section  show  that 
the  algorithm  produces  a  striking  visual  improvement,  we 
now  show  how  it  also  facilitates  the  feature  extraction  task. 
Surfaces  curvature  has  been  widely  used  recently  by  many 
researchers  to  achieve  range  image  segmentation  [2,5,10,25]. 
Indeed,  local  curvature  is  an  excellent  tool  to  characterize 
a  surface,  especially  because  it  provides  intrinsic  proper¬ 
ties  of  the  surface.  The  extraction  of  meaningful  features 
from  range  images  is  thus  possible  by  locally  observing  the 
behavior  of  the  curvature.  Zero-crossings  and  extrema  of 
the  maximal  curvature  are  particularly  interesting.  Unfor¬ 
tunately,  as  curvature  is  very  sensitive  to  noise,  it  is  often 
difficult  to  both  detect  and  localize  precisely  such  events. 
Ponce  and  Brady  [25]  and  Fan  et  al  both  propose  to  solve 
this  problem  by  integrating  multiple  scales,  in  effect  detect¬ 
ing  features  at  a  coarser  scale  and  locating  then  at  a  fine 
scale.  Since  there  is,  in  general,  no  one-to-one  correspon¬ 
dence,  they  use  heuristics  to  resolve  ambiguities. 

We  now  demonstrate  how  our  Adaptive  Filter  greatly  facil¬ 
itates  the  extraction  of  meaningful  features  from  range  im¬ 
ages  without  the  drawbacks  of  the  other  methods.  As  this 
smoothing  scheme  preserves  surface  discontinuities,  their 
detection  and  their  localization  is  immediate.  Further¬ 
more.  as  this  smoothing  process  is  applied  to  the  first  par¬ 
tial  derivatives,  they  arc  directly  available  to  compute  the 
second  partial  derivatives  and  hence  the  principal  curva¬ 
tures.  Figure  6  shows  the  extraction  of  the  local  extrema 
of  the  maximal  curvature  from  the  range  image  of  the  chair. 


This  extraction  is  performed  as  follows:  we  fisrt  label  all 
the  points  in  the  image  which  correspond  to  extrema  of  the 
maximal  curvature  (in  the  direction  of  the  maximal  curva¬ 
ture),  then  starting  from  all  points  at  which  the  maximal 
curvature  is  above  a  certain  threshold  r,  we  extend  the 
“contours”  using  hysteresis  [6],  The  same  threshold  was 
used  for  the  three  images.  Figure  6(a)  and  figure  6(b) 
show  the  results  obtained  respectively  with  the  (gaussian) 
blurred  range  images  of  figure  4(b)  (<r  =  2)  and  figure  4(c) 
{a  =  4).  The  first  one  is  not  smoothed  enough,  therefore 
“false”  extrema  are  extracted.  The  second  one  is  smoothed 
enough  but  their  is  a  lot  of  distorsion  which  causes  the 
extrema  to  be  delocalized.  Figure  6(c)  shows  the  result 
obtained  by  using  our  Adaptive  Filter:  the  extrema  are 
very  well  detected  and  localized.  It  is  also  important  to 
note  that  our  method  is  very  insensitive  to  the  choice  of 
the  threshold  r.  For  completeness,  figure  7  shows  the  ex¬ 
traction  of  the  zero-crossings  and  extrema  of  the  maximal 
curvature  from  the  range  image  of  the  tooth.  This  image  is 
more  complex  with  slow  variations  of  the  curvature.  The 
results  prove  that  at  the  same  time  low  frequency  events 
and  high  frequency  events  (discontinuities)  are  detected 
without  any  cost  in  the  localization  of  the  discontinuities. 
Again,  this  method  does  not  require  any  complex  track¬ 
ing  along  different  scales,  or  any  previous  knowledge  about 
discontinuities.  Our  Adaptive  Filter  is  now  currently  used 
in  our  laboratory  by  our  3-D  Object  Recognition  System 
for  the  segmentation  of  the  range  images  [11],  replacing  the 
heuristic  steps  described  above. 

5  Application  to  Planar  Curve  Cor¬ 
ner  Detection 

In  this  final  section,  we  show  how  our  Adaptive  Filtering 
method  can  be  applied  to  the  smoothing  of  curves  (here 
planar)  in  order  to  extract  meaningful  features  such  as  cor¬ 
ners  in  the  bounding  contour  of  an  object. 


A  planar  curve  such  as  the  bounding  contour  of  an  object 
is  completely  defined  by  the  parametrization  (i(s),  y(s)), 
where  x(s)  and  y(s)  are  two  continuous  functions  of  the 
variable  s.  Usually,  these  two  functions  are  not  smooth,  be¬ 
cause  of  the  discretization  of  the  bounding  contour.  There¬ 
fore,  smoothing  is  generally  necessary  when  significant  lo¬ 
cal  events  need  to  be  extracted  from  this  contour.  Such 
meaningful  features  can  consist  of  discontinuities  of  the 
tangent  (corners),  discontinuities  of  the  curvature  (smooth 
join)  or  inflection  points  along  the  curve  [1,22,21],  As  our 
Adaptive  Filter  smooths  any  original  signal  while  preserv¬ 
ing  its  discontinuities,  it  seems  appropriate  to  use  it  for 
such  a  purpose. 

If  we  wish  to  extract  corners,  the  first  idea  that  comes  to 
mind  is  to  take  as  original  signal  the  tangent  along  the 
curve  8(s)  where: 

#(s)  =  arctan 

x(s) 

Thus,  the  Adaptive  Smoothing  of  this  1-D  signal  would 
be  performed  as  in  section  3.2.1.  But  we  have  seen  that 
if  such  a  signal  is  not  approximatively  piecewise  constant, 
false  discontinuities  are  created  (see  section  3.3.3).  Now, 
the  function  8(s)  is  rarely  piecewise  constant,  except  for 
polygonal  contours.  The  solution  is,  as  proposed  in  sec¬ 
tion  3.3.3,  to  smooth  8'(s)  =  Thus  the  Adaptive 

Filter  applied  to  this  signal  will  smooth  it  while  preserving 
its  discontinuities,  which  are  identical  to  curvature  discon¬ 
tinuities. 

Figure  8(a)  shows  the  bounding  contour  of  a  wrench  and 
figure  8(b)  the  corresponding  1-D  signal  fl'(s)  which  we 
call  curvature  (the  1-D  signal  corresponding  to  the  actual 
curvature  is  very  similar).  As  one  can  see,  there  are  fluc¬ 
tuations  of  the  curvature  along  the  curve  which  are  not 
meaningful  (the  1-D  signal  starts  at  the  far  left  point  of 
the  planar  curve  and  follows  the  contour  clockwise).  The 

result  of  the  application  of  the  Adaptive  Filter  to  this  sig¬ 
nal  can  be  seen  in  figure  10(a)  after  100  iterations.  Curva¬ 
ture  discontinuities  are  enhanced  while  the  signal  has  been 
smoothed  between  them.  It  is  then  possible  to  extract 
all  curvature  discontinuities  (smooth  joins)  and  associate 
those  corresponding  to  the  same  event  (corners).  Another 
possibility  is  to  use  that  1-D  signal  to  encode  the  curve  into 
piecewise  constant  areas.  Here,  for  the  sake  of  demonstra¬ 
tion,  we  only  extract  the  extrema  of  the  curvature.  The 
diagram  of  figure  10(b)  shows  the  behavior  of  these  peaks 
of  curvature  as  the  iteration  progresses.  Again,  there  is  nei¬ 
ther  displacement  nor  merging  of  the  features.  Figure  10(c) 
shows  the  corresponding  points  superimposed  on  the  con¬ 
tour.  The  filled  circles  correspond  to  positive  extrema  of 
the  curvature  and  the  others  to  negative  extrema.  This 


really  illustrate  how  well  the  corners  are  detected  and  lo¬ 
calized  by  using  the  Adaptive  Filter.  For  comparison,  we 
have  smoothed  the  signal  of  figure  8(b)  by  repeated  aver¬ 
aging  with  k  =  0  which  gives  after  100  iterations  the  signal 
of  figure  9(a).  This  signal  is  very  much  blurred  and  the 
diagram  of  figure  9(b)  shows  how  the  extrema  of  the  cur¬ 
vature  are  displaced  as  iteration  progresses,  some  of  them 
being  “lost”  after  a  certain  amount  of  smoothing  as  nearby 
features  have  merged.  Figure  9(c)  shows  the  corresponding 
points  on  the  curve. 


(a)  Bounding  Contour  of  a  wrench 
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(b)  Curvature  along  the  2-D  curve 

Figure  8: 

Curvature  along  the  bounding  contour  of  a  wrench 
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(a)  Gaussian  Smoothing  of  the  Curvature 


Scale-Space  Representation 

(b)  Extrema  of  Curvature  at  each  iteration 
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(b)  Extrema  of  Curvature  at  each  iteration 


(c)  Corner  Extraction 

Figure  9:  Corner  Extraction  after 
Gaussian  Smoothing  of  the  Curvature 


(c)  Corner  Extraction 

Figure  10:  Corner  Extraction  after 
Adaptive  Smoothing  of  the  Curvature 
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6  Conclusion 


We  have  presented  an  Adaptive  Filter  which  in  its  basic 
formulation  provides  a  new  formalism  for  the  Anisotropic 
Diffusion  developped  by  Perona  and  Malik.  We  believe 
that  this  method  is  simpler  and  give  more  stable  results. 
Furthermore,  we  have  shown  that  this  framework  can  be 
extended  easily  to  higher  order  discontinuitities  preserva¬ 
tion.  Impressive  results  on  intensity  images,  range  images 
and  also  planar  curves  have  been  provided.  Nevertheless, 
we  think  that  we  can  improve  the  robustness  of  our  Adap¬ 
tive  Smoothing  method  in  particular  for  2-D  signals.  In 
this  case,  the  estimate  of  d  can  be  more  complex  as  more 
information  such  as  connectivity  is  present  in  images.  Also, 
it  is  important  that  d  does  not  depend  on  the  scale.  We  are 
currently  investigating  these  ideas  in  order  to  enhance  the 
already  nice  performance  of  our  Adaptive  Filter  which  has 
proven  to  be  a  powerful  tool  for  the  extraction  of  meaning¬ 
ful  features. 
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Abstract.  This  paper  shows  that  a  correspondence  be¬ 
tween  three  points  on  a  rigid  solid  object  and  three  points 
in  a  two-dimensional  image  is  sufficient  to  align  the  object 
with  the  image.  The  method  uses  a  “weak  perspective” 
viewing  model,  where  true  perspective  is  approximated  by 
orthographic  projection  plus  scale.  We  show  that  the  align¬ 
ment  transformation  exists,  and  is  unique  up  to  a  reflec¬ 
tion,  for  any  non-colinear  triple  of  object  and  image  points. 
The  transformation  can  also  be  computed  from  edge  frag¬ 
ments,  where  the  endpoints  are  unknown.  For  planar  ob¬ 
jects,  the  transformation  is  equivalent  to  a  two-dimensional 
affine  transform  from  the  object  to  the  image.  We  have  im¬ 
plemented  a  recognition  system  that  uses  one  or  two  pairs 
of  model  and  image  features  to  hypothesize  potential  align¬ 
ments  of  object  models  with  an  image.  Each  hypothesis 
is  verified  by  projecting  the  model  features  into  the  image. 
Examples  using  points  and  edges  extracted  from  grey-level 
images  are  shown. 


1  Introduction 

Object  recognition  is  generally  formulated  as  the  problem  of 
matching  one  or  more  object  models  to  an  image  containing 
zero  or  more  instances  of  these  objects.  The  matching  pro¬ 
cess  finds  instances  of  the  models  in  the  image,  and  for  each 
instance  determines  a  transformation  mapping  the  model 
onto  the  instance.  This  transformation  generally  specifies 
the  position  and  orientation  of  the  object  with  respect  to 
the  viewer  (the  object  pose),  although  for  certain  tasks  it 
does  not. 

Support  was  provided  in  part  bv  the  Advanced  Research  Projects 
Agency  of  the  Department  of  Defense  under  Army  contract 
DACA76-85-C-0010.  in  part  by  the  Office  of  Naval  Research 
University  Research  Initiative  Program  under  Office  of  Naval 
Research  contract  N00014-86-K-0685,  and  in  part  by  the  Ad¬ 
vanced  Research  Projects  Agency  of  the  Department  of  De¬ 
fense  under  Office  of  Naval  Research  Contract  NOOOl-t-85-K- 
0124. 


A  number  of  systems  have  been  developed  within  this 
model  based  recognition  framework  (for  a  recent  review  [1]). 
These  systems  generally  address  tasks  where  the  imaging 
process  does  not  involve  projection  from  the  world  into  the 
image;  for  instance,  tasks  where  the  objects  are  in  a  known 
plane  (2D  recognition),  or  tasks  where  the  “image”  is  three- 
dimensional,  specifying  a  distance  to  the  viewer  at  each 
pixel  (3D  from  3D  recognition).  This  restriction  eliminates 
the  problems  of  identifying  image  features  under  projection, 
and  recovering  the  three-dimensional  position  and  orienta¬ 
tion  of  an  object  from  a  two-dimensional  view. 

In  this  paper,  we  address  the  problem  of  recognizing 
rigid  solid  objects  with  unknown  position  and  orientation 
in  three-space,  from  a  single  two-dimensional  image.  The 
approach  is  to  determine  possible  alignments  of  a  model 
with  an  image,  and  then  verify  each  hypothesized  position 
and  orientation  by  transforming  the  model  into  image  coor¬ 
dinates.  The  method  is  an  extension  of  our  earlier  work  on 
aligning  objects  with  images  [8].  There  are  two  significant 
advances  reported  here.  First,  we  show  that  the  alignment 
transformation  exists  and  is  unique  (up  to  a  reflection)  for 
any  triple  of  non-colinear  model  points.  We  also  present  a 
closed  form  solution  for  the  alignment  transformation  that 
is  simple  to  compute.  Second,  we  have  implemented  an 
alignment  based  recognizer  for  solid  objects,  whereas  our 
earlier  implementation  was  restricted  to  planar  objects. 

1.1  The  Imaging  Model 

3D  from  2D  recognition  involves  projection  from  the  world 
into  the  image,  so  the  imaging  process  must  be  modeled 
in  order  to  recover  the  position  and  orientation  of  an  ob¬ 
ject  with  respect  to  an  image.  Throughout  this  paper,  we 
assume  a  coordinate  system  with  the  origin  on  the  image 
plane,  /,  and  with  the  c-axis  normal  to  I. 

Perspective  projection  is  the  most  accurate  model  of 
the  imaging  process.  As  shown  in  Figure  1,  under  this 
model  the  center  of  projection.  /,  is  defined  to  be  a  point 
along  the  viewing  axis,  v,  which  is  perpendicular  to  the 
image  plane,  I.  Each  point  in  the  world,  p.  is  connected 
to  its  image,  p',  in  /,  by  a  ray,  P ,  that  passes  through  the 


f->c~ '  point  f.  The  focal  length  is  the  distance  along  t>  from 
/  to  the  image  plane,  I. 

While  perspective  projection  is  an  accurate  model  of 
imaging,  it  is  relatively  complicated  to  recover  the  posi¬ 
tion  and  orientation  of  an  object  from  an  image  under  per¬ 
spective  projection.  In  general,  solving  for  the  transforma¬ 
tion  requires  six  model  points  and  six  corresponding  image 
points  [6].  Furthermore,  the  equations  are  relatively  un¬ 
stable,  so  in  practice  getting  a  good  estimate  of  the  trans¬ 
formation  requires  more  than  six  pairs  of  points,  and  the 
use  of  a  least  squares  or  other  error  minimization  procedure 
(such  as  in  [12]). 


Figure  1.  Perspective  projection. 


Under  orthographic  (or  parallel)  projection,  each  point,  p, 
in  the  world  is  connected  to  its  image,  p\  in  the  image 
plane,  /,  by  a  ray,  P,  which  is  perpendicular  to  /,  as  shown 
in  Figure  2.  The  major  difference  between  perspective  and 
orthographic  projection  is  that  under  perspective  objects 
that  are  further  away  appear  smaller.  Thus,  if  a  linear 
scale  factor  is  added  to  orthographic  projection,  a  relatively 
good  approximation  to  perspective  is  obtained.  The  ap¬ 
proximation  becomes  poor  when  the  object  being  imaged 
is  very  deep  with  respect  to  the  field  of  view,  because  a 
single  scale  factor  cannot  be  used  for  the  entire  object  [14]. 
For  instance,  railroad  tracks  going  off  towards  the  horizon, 
or  an  object  viewed  from  very  close  tip  are  not  well  approx¬ 
imated  by  this  model.  Orthographic  projection  plus  scale 
has  been  termed  “weak  perspective”,  because  it  approxi¬ 
mates  perspective  well  under  most  viewing  conditions. 

Under  weak  perspective,  all  ^-positions  are  equivalent. 
Thus,  the  transformation  specifying  the  position  and  ori¬ 
entation  of  an  object  with  respect  to  an  image  consists 


Figure  2.  Orthographic  projection. 

of  a  two-dimensional  translation  (in  i  and  y).  a  three- 
dimensional  rotation,  and  a  scale  factor  that  depends  on 
the  distance  to  and  size  of  the  object.  Such  a  transforma¬ 
tion,  which  preserves  relative  lengths,  is  called  a  similarity 
transform. 

1.2  Transforming  a  Model  to  an  Image 

The  problem  of  recovering  a  transformation  that  maps  an 
object  model  onto  an  image  has  been  a  major  focus  of  re¬ 
search  on  object  recognition.  Two  general  methods  have 
been  used  for  recovering  the  transformation.  The  first  method 
computes  global  properties  of  an  image  (such  as  moments 
of  inertia),  and  determines  the  transformation  from  these 
image  properties  and  corresponding  model  properties.  The 
problem  with  this  approach  is  that  the  properties  are  based 
on  the  entire  image  and  the  entire  model.  Therefore,  the 
method  is  not  applicable  to  an  image  that  contains  mul¬ 
tiple  objects,  or  partially  occluded  instances  of  an  object. 
The  second  method  uses  corresponding  local  model  and  im¬ 
age  features  to  recover  the  transformation.  These  features 
are  local,  so  occlusion  and  multiple  objects  are  less  of  a 
problem.  A  local  feature,  however,  docs  not  generally  con¬ 
tain  enough  information  to  solve  for  the  transformation,  so 
groups  of  mode!  and  image  features  must  be  used.  There¬ 
fore  this  approach  can  involve  substantial  search  among  the 
sets  of  possible  corresponding  model  and  image  features. 

The  idea  behind  the  alignment  method  presented  here 
is  to  identify  the  minimum  amount  of  information  needed 
to  solve  for  a  possible  position  and  orientation,  and  thereby 
to  minimize  the  amount  of  search  required  in  matching  lo¬ 
cal  model  and  image  features.  In  Section  2  we  show  that 
under  weak  perspective,  three  non-colinear  model  points 
and  three  corresponding  image  points  are  sufficent  to  align 
a  model  with  an  image.  In  contrast,  under  perspective  pro¬ 
jection  it  takes  six  corresponding  model  and  imago  points 
to  determine  the  position  and  orientation  of  a  solid  object. 

The  weak  perspective  imaging  model  has  been  used  in 
a  number  of  3D  from  2D  recognition  systems  [2]  [4j  [11]  [14], 
These  systems  do  not,  however,  solve  the  problem  of  com¬ 
puting  the  transformation  for  positioning  and  orienting  a 
rigid  solid  object  given  an  image.  Instead,  they  either  use  a 
heuristic  approach  to  recovering  the  transformation  [2],  use 
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approximate  solutions  [13],  store  views  of  an  object  from 
alt  possible  angles  in  order  to  approximate  the  transforma¬ 
tion  by  table  lookup  [14],  or  restrict  recognition  to  planar 
objects  [4]  [11]. 


1.3  Recognition  by  Alignment 


In  Section  3  we  describe  the  ORA  (Object  Recognition  by 
Alignment,  pronounced  ‘ORA” )  system  that  uses  the  align¬ 
ment  transformation  to  hypothesize  possible  positions  and 
orientations  of  objects  in  an  image.  Rather  than  computing 
the  transformation  from  triples  of  model  and  image  points, 
the  recognizer  uses  one  or  two  features  that  define  more 
than  a  single  point.  These  features  are  computed  from  the 
intensity  edges  in  a  grey-level  image. 

There  are  two  classes  of  features,  based  on  the  amount 
of  position  and  orientation  information  that  a  feature  spec¬ 
ifies.  Class  I  features  specify  a  triple  of  points,  and  Class  II 
features  specify  an  oriented  point.  A  single  Class  I  feature 
in  an  image  and  a  corresponding  model  feature  is  enough 
to  compute  the  alignment  transformation,  while  a  pair  of 
corresponding  Class  II  features  are  required.  Class  I  fea¬ 
tures  arc,  for  example,  a  vertex  connected  to  two  complete 
edges,  or  an  edge  connected  to  two  partial  edges.  Class  II 
features  are,  for  example,  a  vertex  connected  to  two  partial 
edges,  a  curved  arc,  or  two  nearby  edge  fragments. 

For  a  given  model,  each  Class  I  feature  is  matched 
with  each  image  feature  of  the  same  type,  and  the  resulting 
alignment  transformations  are  computed.  A  transforma¬ 
tion  is  scored  based  on  the  number  of  model  features  that 
are  brought  into  approximate  correspondence  with  an  im¬ 
age  feature.  Those  transformations  accounting  for  more 
than  half  of  a  model’s  features  are  kept  as  matches.  Once 
the  Class  I  features  are  exhausted,  each  pair  of  Class  II 
model  features  is  matched  with  each  pair  of  Class  II  image 
features  that  do  not  correspond  to  some  already  aligned 
model. 

The  worst,  case  running  time  of  the  matching  process  is 
()(ui  'tl  I  for  til  model  features  and  i  image  features,  because 
it  may  be  necessary  to  consider  all  pairs  of  model  features 
against  all  pairs  of  image  features,  and  scoring  each  match 
involves  transforming  all  the  model  features  to  image  coor¬ 
dinates.  In  practice,  however,  it  is  not  necessary  to  consider 
all  pairs  of  features,  because  the  transformations  computed 
from  the  Class  I  features  have  already  accounted  for  many 
of  the  Class  II  image  features.  Furthermore,  to  the  extent 
that  any  grouping  or  classification  of  features  can  be  done, 
the  number  of  matches  considered  ran  be  reduced  by  using 
only  certain  pairs  of  features  rather  than  taking  all  pairs. 


The  problem  of  accurately  and  efficiently  determining  po¬ 
tential  alignments  of  a  model  with  an  image  is  a  central 
problem  in  recognition.  In  this  section  we  show  that  three 
pairs  of  corresponding  model  and  image  rwfiuts  specify  a 
unique  (up  to  a  reflection)  position  and  orientation  of  a 
model  with  respect  to  an  image.  In  the  case  of  a  planar 
model,  the  alignment  transform  is  equivalent  to  an  affine 
transform  from  the  model  plane  to  the  image  plane.  Un¬ 
like  the  affine  transform,  however,  the  alignment,  transform 
applies  to  solid  models. 

A  closed  form  solution  for  computing  the  alignment 
transform  directly  from  three  pairs  of  corresponding  model 
and  image  points  is  also  presented.  The  method  also  applies 
to  two  pairs  of  oriented  model  and  image  points,  or  to  three 
pairs  of  model  and  image  edge  fragments  (without  knowing 
the  endpoints  of  the  edges).  Finally,  the  issue  of  recovering 
non-rigid  transformations  is  addressed. 

2.1  The  2D  Affine  Transform 

Some  systems  for  3D  from  2D  recognition  of  planar  objects 
use  a  two-dimensional  affine  transform  to  map  the  object 
plane  to  the  image  plane  [4]  [11].  A  two-dimensional  affine 
transform,  A  :  x  — *  x',  can  be  represented  as  a  non-singular 
2x2  matrix,  L,  anti  a  two-dimensional  vector,  b.  such  that 
x'  =  Lx  +  b.  An  affine  transformation  preserves  paral¬ 
lelism,  but  captures  scaling  and  shearing  of  a  plane. 

A  two-dimensional  affine  transform  can  be  used  to  map 
a  plane  in  space  onto  its  image  under  orthographic  projec¬ 
tion  plus  scale  [11],  The  converse,  that,  a  two-dimensional 
affine  transform  corresponds  to  a  position,  orientation  and 
scale  of  a  plane  in  space  is  not.  however,  an  established 
result.  Therefore,  approaches  that  solve  for  an  affine  trans¬ 
form  from  a  model  plane  to  an  image  [4]  [11]  have  not  estab¬ 
lished  that  a  solution  must  correspond  to  a  possible  three- 
dimensional  position  and  orientation  of  the  model  plane. 

The  proof  in  the  next,  section,  that  three  non-colinear 
model  and  image  points  always  uniquely  define  a  three- 
dimensional  alignment  transformation,  makes  use  of  the 
two-dimensional  affine  transform  from  a  model  plane  to  an 
image.  It  is  shown  that  a  two-dimensional  affine  transform 
always  corresponds  to  a  unique  three-dimensional  similar¬ 
ity  transform  (up  to  a  reflective  ambiguity ).  Thus  an  affine 
transformation  mapping  a  model  plane  to  the  image  plane 
does  in  fact  always  specify  a  three-dimensional  position  and 
orientation. 

A  partial  version  of  the  existence  and  uniqueness  of 
the  alignment  transform  has  been  shown  by  Kanade  and 
Render  [10].  They  use  an  affine  transformation  lietween  two 
planar  patches  to  derive  a  constraint  on  t  he  relative  orienta 
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tion  of  the  patches.  First  they  assume  that  if  there  is  a  two- 
dimensional  affine  transformation  x'  =  Lx  -I-  b.  for  x  on  P2 
and  x'  on  Pj,  then  P\  and  P2  are  projections  of  planar  pat¬ 
terns  P[  and  Pj  in  three-dimensions,  which  are  related  by  a 
similarity  transformation  (by  a  three-dimensional  rotation, 
translation  and  scale).  Then  they  proceed  to  show  that  this 
assumption  constrains  the  relative  three-dimensional  orien¬ 
tations  of  Pj  and  Pj  to  two  possible  configurations,  which 
are  reflections  of  one  another.  A  restricted  form  of  this  con¬ 
straint  is  used  for  recovering  three-dimensional  shape  from 
skewed  symmetries  in  a  two-dimensional  image. 

Kanade  and  Render  express  their  assumption  as 
LT2  =  TictR, 

where  T,  rotates  Pj  about  the  x  and  y  axes  so  that  it  is 
parallel  to  the  2  =  0  plane,  R  is  rotation  about  the  2-axis 
(the  viewing  axis),  and  a  is  a  linear  scale  factor.  From 
this  equation,  they  derive  two  equations  relating  the  slant 
and  tilt  of  Pj  and  Pj.  These  equations  are  claimed  to  al¬ 
ways  have  two  symmetric  solutions.  The  proof,  however, 
relies  on  det(L)  >  0,  which  is  not  the  case  when  there  is 
a  reflection  involved  in  the  transformation  from  Pj  to  Pj. 
Furthermore,  the  uniqueness  of  the  rotation  about  the  z- 
axis  is  not  established.  Thus  they  do  not  show  that  in 
general  a  two-dimensional  affine  transform  is  always  the 
orthographic  projection  of  a  unique  three-dimensional  sim¬ 
ilarity  transform. 

2.2  The  Alignment  Transformation  Exists  and  is 
Unique 

The  major  result  of  this  section  is  that  the  correspondence 
of  three  non-colinear  points  is  sufficient  to  determine  the 
position,  three-dimensional  orientation,  and  scale  of  a  rigid 
solid  object  with  respect  to  a  two-dimensional  image.  The 
result  is  shown  in  several  stages.  First  it  is  established  that 
an  affine  transformation  of  the  plane  is  uniquely  defined  by 
three  pairs  of  non-colinear  points.  Then  a  linear  transfor¬ 
mation  of  the  plane  is  shown  to  uniquely  define  a  similarity 
transformation  of  space,  specifying  the  orientation  of  one 
plane  with  respect  to  another  (up  to  a  reflection). 

These  two  results  are  then  combined  to  show  that  three 
pairs  of  non-colinear  points  define  the  position  and  orienta¬ 
tion  of  one  plane  with  respect  to  another  (up  to  reflection). 
For  a  rigid  object,  the  position  and  orientation  of  one  plane 
of  the  object  defines  the  position  and  orientation  of  the  en 
tire  object.  Tims,  the  method  applies  directly  to  rigid  solid 
objects.  In  the  case  of  planar  objects,  the  reflective  am¬ 
biguity  in  the  transformation  is  not  detectable.  For  solid 
objects,  the  ambiguity  may  be  resolved  using  a  point  not 
coplanar  with  the  three  alignment  points. 

Lemma  1.  Given  three  non-colinear  points  a,„ ,  b , 
and  cm  in  the  plane,  and  three  corre.spondinq  points  a,,  b,. 


and  Ci  in  the  plane,  there  exists  a  unique  affine  transfor¬ 
mation,  A(x)  =  Lx  +  b,  for  any  two-dimensional  vector  x, 
where  L  is  a  linear  transformation  and  b  is  a  translation, 
such  that  A( am)  =  a*,  A(bm)  =  b,,  and  A(cm)  =  c,. 

This  fact  is  relatively  well  known  in  higher  geometry, 
and  its  proof  is  simple  (see  for  example  [5]  [9]). 

Definition  1.  A  transformation,  T,  is  a  similarity 
transform  over  a  vector  space  V  when 

l|Tv,||  =  ||Tv2||  «=>  ||v.||  =  ||v2|| 

Tv]  •  Tv2  =  0  <=t>  V]  •  v2  =  0 
for  any  V] ,  v2  in  V . 

Theorem  1.  Given  a  linear  transformation  of  the 
plane,  L,  there  exists  a  unique  (up  to  a  reflection)  similarity 
transform  of  space,  U,  such  that  Lv  =  Uv*  for  any  two- 
dimensional  vector  v,  where  v*  =  ( x,y,0 )  for  any  v  = 
(x,  y),  and  v  =  w  iff  v  =  (x,y)  and  w  =  (i ,y,z). 

The  structure  of  the  equivalence  between  L  and  U 
stated  in  the  theorem  is, 

V2  L i  v2 

I*  1" 

V3  V3 

where  V2  and  V3  are  two-  and  three-dimensional  vector 
spaces,  respectively.  The  geometrical  interpretation  of  U 
is  as  a  rotation  and  scale  of  two  basis  vectors  (defining  a 
plane)  so  that  their  image  under  orthographic  projection  is 
the  same  as  applying  L  to  the  basis  vectors,  as  shown  in 
Figure  3. 


U 


Figure  3.  The  geometrical  interpretation  ofU. 


Proof.  Clearly  some  three-dimensional  transformation  U 
must  exist  that  satisfies  the  definition  of  the  theorem  (for 
instance  just  embed  L  in  the  upper-left  part  of  a  3  x  3 
matrix,  with  the  remaining  entries  all  zero).  What  must 
be  shown  is  that  there  is  always  a  U  that  is  a  similarity 
transformation,  and  that  this  U  is  unique  up  to  a  reflection. 
In  order  for  U  to  be  a  similarity  transform,  it  must  sat¬ 
isfy’  the  two  properties  of  Definition  1.  We  show  that  these 
properties  are  equivalent  to  two  equations  in  two  unknowns, 
and  that  these  equations  always  have  two  solutions  differ¬ 
ing  in  sign.  Thus  U  always  exists  and  is  unique  tip  to  a 
reflection  corresponding  to  the  sign  ambiguity. 
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Let  ej  and  e2  be  orthonormal  vectors  in  the  plane, 

with 

e\  =  Lei 
e2  =  Le2. 

If  vi  =  Ue'  and  v2  =  Ue2,  then  by  the  definition  of  U  we 
have, 

V!  =e'i  +ciz, 
v2  =  e'j  +  c2z. 

where  z  =  (0,0, 1),  and  Cj  and  c2  aie  constant-. 

U  is  a  similarity  transformation  iff 
v5  ■  v2  =  0 

llvl||  =  llv2|l- 

because  ej  and  ej  are  orthogonal  and  of  the  same  length. 
From 

Vi  ■  v2  =  0, 

(e',  4-  C]Z)  ■  (e'2  +  c2z)  =  0, 
e'j  •  e2  +  cic2  =  0, 

and  hence 

cic2  =  -e'j  •  e2. 

The  right  side  of  this  equality  is  an  observable  quantity, 
because  we  know  L  and  can  apply  it  to  ei  and  e2  to  obtain 
e',  and  e2.  Call  — e',  ■  e2  the  constant  fcj. 

In  order  for 


it  must  also  be  that 


l|v,||  =  ||v2| 
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Ci‘ -  C2‘ =  \\e'2r  -  MW- 

Again  the  right  side  of  the  equality  is  observable,  call  it  k2 . 
It  remains  to  be  shown  that  these  two  equations, 

Cl  c2  =  fcl 
Cl2  -  C22  =  k2, 

always  have  a  solution  that  is  unique  up  to  a  sign  ambiguity. 
Substituting 

*i 

c2  =  — 

Cl 

into  the  latter  equation  and  rearranging  terms  we  obtain 
ci4-fc2c,2-fci2=0, 

a  quadratic  in  ci2.  Substituting  x  for  ci2  and  using  the 
quadratic  formula  yields 

x  =  l(fc2  ±  V/^2  +  4fc,2). 

A  quadratic  always  has  one  or  two  solutions.  We  are 
only  interested  in  positive  solutions  for  x,  because  r>  =  i/x. 
There  can  only  be  one  positive  solution,  as  4Iq2  >  0,  and 
thus  the  quantity  inside  the  square  root  is  >  k2.  Hence 
there  are  exactly  two  real  solutions  for  Ci  =  ± \fx.  For  the 
positive  and  negative  solutions  to  ci  there  arc  correspond¬ 
ing  solutions  of  opposite  sign  to  c2,  because  C\C2  =  Iq. 


The  equation  for  x  does  not  have  a  solution  when  cj  = 
0,  because  the  substitution  for  c2  is  undefined.  When  ci  = 
0,  however,  c2  =  ±\f~k2,  which  always  has  two  solutions 
because  k2  <  0.  Recall 


^H2+C!2  = 


e2l|2  +  c22> 


therefore,  because  ci  =  0, 


and  hence 


>  II4II2. 


Thus  there  are  always  exactly  two  solutions  for  ci  and 
c2,  differing  in  sign.  These  equations  have  a  solution  iff 
the  similarity  transform,  U,  exists.  So  there  are  always 
exactly  two  solutions  for  U  which  differ  in  the  sign  of  the 
z  component  of  x',  where  x'  =  Ux,  and  hence  the  sign 
difference  corresponds  to  a  reflective  ambiguity  in  U.  | 

We  can  now  prove  Theorem  2,  the  major  result  of  this 
section. 

Theorem  2.  Given  three  non-colinear  points  am,  bm, 
and  cm  in  the  plane,  and  three  corresponding  points  a,, 
bj,  and  Cj  in  the  plane,  there  exists  a  unique  similarity 
transformation  (up  to  a  reflection),  Q,  such  that  Q(aJ^)  = 
ai»  Q(bm)  -  hi,  and  Q{c*m)  S  c,,  where  v*  =  (x,y,0)  for 
any  v  =  (x,y),  and  v  ~  w  iff  v  =  (x,y)  and  w  —  (x,y,z). 
The  transformation  Q  is  a  three-dimensional  translation, 
rotation,  and  a  scale  factor. 

Proof.  From  Lemma  1  there  is  a  unique  affine  transfor¬ 
mation  such  that  A(am)  =  ai,  A(bm)  =  b,,  and  A(cm)  = 
c,.  By  the  definition  of  an  affine  transformation,  A  can 
consists  of  two  components,  a  translation  vector  b  and  a 
linear  transformation  L.  Because  A  is  uniquely  defined  by 
three  pairs  of  points,  we  can  choose  one  of  the  three  points 
pairs  to  define  b  =  am  —  ai-  Then  L  is  the  linear  transfor¬ 
mation  such  that  Lbm  —  b  =  b,  and  Lcm  —  b  =  c,.  Using 
bm  —  b(  or  cm  —  Ci  to  define  b  is  equivalent,  because  the 
resulting  vector  and  linear  transformation  must  specify  the 
same  affine  transformation,  A. 

Given  L,  by  Theorem  1  there  is  a  unique  (up  to  a  re¬ 
flection)  rotation  and  scale,  U,  such  that  Uv  =  Lv  for  all 
two-dimensional  vectors  v.  Combining  b  and  U,  specifies 
a  unique  similarity  transformation  Q  consisting  of  transla¬ 
tion,  rotation,  and  scale  such  that  Q(am)  =  a*,  C?(bm)  = 
b,,  and  Q{ cm)  =  c,.  | 

iMote  that  it  is  not  always  necessary  to  explicitly  com¬ 
pute  the  three-dimensional  transformation  Q  in  order  to 
transform  a  model  to  an  image.  In  the  case  that  the  model 
is  planar,  the  affine  transformation  A  is  sufficient  to  map 
points  from  the  model  plane  to  the  image  plane. 

2.3  Computing  the  Transformation 

The  previous  section  established  the  existence  and  unique¬ 
ness  of  the  transformation  Q  mapping  model  to  image  points 


m 


(and  the  corresponding  affine  transformation  >1).  This  sec¬ 
tion  shows  how  to  compute  Q  (and  .4)  given  three  pairs 
of  points  (am,  a,),  (6m,M  and  ( cm,ci ),  where  the  image 
points  are  in  two-dimensional  sensor  coordinates  and  the 
model  points  are  in  three-dimensional  object  coordinates. 
Step  0.  Rotate  and  translate  the  model  so  that  a’m  is  at 
the  origin  (0,0,0),  and  b'm  and  c'm  are  in  the  z  =  0  plane. 
Note  that  this  operation  can  be  performed  offline  for  each 
triple  of  model  points. 

Step  1.  Translate  the  coordinates  of  the  image  points  6; 
and  Ci  in  terms  of  the  new  origin  a,,  calling  the  resulting 
points  6'  and  cj. 

Step  2.  Solve  for  the  linear  transformation 


-(:  I) 


given  by  the  two  pairs  of  equations  in  two  unknowns 
axb  +  byb  =  ij, 
axc  +  byc  =  x'c, 

and 

cx  6  +  dyb  =  y[ 
cxc  +  dyc  =  y'c, 

where  ( xb,yb )  =  b'm,  ( xc,yc )  =  c'm,  (x'k,y'b)  =  &•,  and 
(i(,  y'c)  =  c'.  These  first  two  steps  yield  the  affine  transfor¬ 
mation,  A. 

Step  3.  Apply  L  to  the  orthogonal  axes  (1,0)  and  (0,1) 
to  obtain  e\  =  (a,c)  and  e'2  =  (b,  d). 

Step  4.  Solve  for  Ci  and  c2,  the  z  components  of  Vj  and 
v2,  using 


ci  =  ±\l ~(w  +  Jw2  +  4(e'i  •  e2)2), 


where  w 


e  ‘  -  e 


Step  5.  Define 


v,  =  e',  +  ci  z, 
v2  =  e2  +  c2z, 


cos  a  sin  0 


sin  a  sin  0 


v3  =  Vi  x  v2. 

The  rotation  matrix,  R,  is  defined  by  the  three  equations 
u,  =  R(1,0,0)t, 
u2  =  R(0,1,0)t, 
u3  =  R(0,0, 1)T, 

where  Ui  is  the  unit  vector  in  the  direction  Vi ,  and  similarly 
for  u2  and  u3. 

If  rotations  in  terms  of  Euler  angles  are  needed,  then 
the  angles  a,  0,  and  7  can  also  be  defined, 

!/3 

a  =  arctan  — , 
x3 

a  ’1 

0  =  arccos  - - 


where  v3  =  (x3.y3,z3),  and 

7  =  ^v2q, 

where  q  is  the  intersection  of  the  V;  x  v2  plane  with  the 
x  —  y  plane,  which  can  be  computed  by  rotating  the  y-axis. 
(0, 1)  about  the  origin  by  a.  These  Euler  angles  define  the 
rotation  matrix, 

•  coso  cos  0  cos  7  —(cos  a  cos  0  sin  7  .  - 

cos  a  sin  0 

—  smosin7  +  sin  a  cos  7) 

R  =  sin  a  cos  0  cos  7  —  sin  a  cos  0  sin  7  .  .  _ 

sin  a  sin  0 

+  cos  a  sin  7  +  cos  a  cos  7 

—  sin  0  cos  7  sin  0  sin  7  cos  0 

Step  6.  Finally,  the  scale  factor  S  =  ||vi||. 

Note  that  the  other  solution,  where  ci  and  c2  are  both 
negated,  corresponds  to  reflecting  the  model  about  the  plane 
defined  by  the  three  model  points. 

This  method  of  computing  the  alignment  transforma¬ 
tion  is  relatively  fast,  because  it  involves  a  small  number  of 
terms,  none  of  which  are  more  than  quadratic.  My  imple¬ 
mentation  on  a  Symbolics  3650  takes  about  8  milliseconds 
to  compute  the  transformation. 


3  Recognition  Using  Alignment 

We  have  implemented  a  3D  from  2D  recognition  system 
called  ORA  (for  Object  Recognition  by  Alignment,  pro¬ 
nounced  “aura”).  The  ORA  system  uses  one  or  two  pairs  of 
model  and  image  features  to  compute  possible  alignments 
of  solid  objects  with  a  two  dimensional  image.  The  sys¬ 
tem  first  extracts  features  from  the  intensity  edges  in  an 
image.  These  features  are  then  matched  against  model  fea¬ 
tures,  and  used  to  solve  for  possible  alignments  of  a  model 
with  the  image.  Each  alignment  is  scored  by  projecting  the 
aligned  model  features  into  the  image,  and  counting  the 
number  of  projected  features  for  which  there  is  a  nearby 
image  feature  of  the  same  type  and  similar  orientation. 
Alignments  that  account  for  a  high  percentage  of  a  model’s 
features  are  kept  as  correct  matches. 

ORA  uses  three  types  of  primitive  features:  straight 
edges,  corners,  and  arcs.  Groups  of  connected  or  nearly- 
connected  primitive  features  are  combined  to  form  align¬ 
ment  features.  Each  alignment  feature  defines  either  a 
triple  of  points  or  an  oriented  point.  Corresponding  align¬ 
ment  features  in  a  model  and  an  image  are  used  alone  or  in 
pairs  to  recover  the  three-dimensional  transformation  from 
a  model  to  an  image. 

A  grey-level  image  is  first  processed  using  a  Canny  op¬ 
erator  [3]  to  extract  intensity  edges.  Edge  pixels  are  chained 
into  edge  contours  by  following  unambiguous  8-way  neigh¬ 
bors.  At  any  ambiguity  point,  new  chains  are  started.  Each 
chain  is  segmented  at  inflection  points  and  at  the  ends  of 
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low-curvature  regions,  as  described  in  [8].  Segments  are 
categorized  into  the  primitive  features:  straight  edge,  cor¬ 
ner,  or  arc.  A  segment  that  can  be  well  approximated  by 
a  single  straight  line  is  classified  as  a  straight  edge.  A  seg¬ 
ment  that  has  a  local  high  curvature  portion  is  classified  as 
a  comer.  The  remaining  segments  are  classified  as  arcs. 

The  Class  I  features,  those  defining  three  points,  are 
shown  in  Figure  4.  The  features  are:  i)  an  edge  with  two 
partial  edges,  ii)  a  corner  with  two  complete  edges,  and 
iii)  an  arc  with  two  partial  arcs  (or  edges).  The  Class  II 
features,  those  defining  an  oriented  point,  are  shown  in  Fig¬ 
ure  5.  The  features  are:  i)  a  corner  with  two  partial  edges, 
and  ii)  an  inflection  joining  two  arcs. 


Figure  4.  Class  I  features  define  three  points.  A  single 
matching  feature  is  sufficient  for  alignment. 
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Figure  5.  Class  II  features  define  an  oriented  point.  A 

pair  of  matching  features  are  sufficient  for  alignment. 

A  model  consists  of  the  three-dimensional  locations  of  the 
edges  of  its  surfaces.  A  contour  is  represented  by  a  chain 
of  three-dimensional  edge-pixel  locations.  Properties  of  a 
surface  itself  are  not  used  (e.g.,  the  curvature  of  the  sur¬ 
face),  only  the  bounding  contours.  Only  the  set  of  surfaces 
visible  from  a  given  viewpoint  are  used  to  form  a  model. 
Thus  for  some  objects,  several  models  from  different  views 
are  needed.  For  objects  whose  shape  is  defined  mainly  by 
surface  characteristics,  rather  than  by  the  edges  of  surfaces, 
this  modeling  technique  is  insufficient.  For  most  manmade 
objects,  however,  the  models  appear  to  be  adequate. 

3.1  The  Matching  Algorithm 

After  the  alignment  features  are  extracted  from  an  image, 
all  pairs  of  Class  I  model  and  image  features  are  used  to 
solve  for  potential  alignments,  and  each  alignment  is  veri¬ 
fied  using  all  the  model  and  image  features.  If  an  alignment 
maps  sufficiently  many  model  features  to  image  features, 
then  it  is  accepted  as  a  correct  match  of  an  object  model 
to  an  instance  in  the  image.  Each  image  feature  that  is 
accounted  for  by  an  accepted  alignment  is  removed  from 
further  consideration  by  the  matcher.  The  exact  algorithm 
is  given  below,  where  the  variable  ALIGNMENTS  is  used  to 


accumulate  the  accepted  alignments,  IMAT  is  a  boolean  ta¬ 
ble  indicating  the  image  features  that  have  been  matched 
to  some  model  feature,  and  MCLASS1  and  ICLASS1  are  the 
Class  I  model  and  image  features.  The  procedure  ALIGN1 
computes  the  possible  alignments  from  a  pair  of  Class  I 
model  and  image  features,  VERIFY  finds  all  matching  model 
and  image  feature  pairs  given  an  alignment,  GOOD-ENOUGH 
checks  that  there  are  sufficiently  many  feature  pairs,  and 
PAIR-IFEAT  is  a  selector  that  returns  the  image  feature  of 
a  feature  pair. 

for  MFEAT  in  MCLASS1 
do  for  IFEAT  in  ICLASS1 
do  if  not  I  MAT  (I  FEAT) 

then  for  ALIGNMENT  in  ALIGN1 (MFEAT, IFEAT) 
do  MATCH  <-  VERIFY (ALIGNMENT) 
if  GOOD- ENOUGH (MATCH) 

then  put  ALIGNMENT  on  ALIGNMENTS 
for  PAIR  in  MATCH 

do  IFEAT  PAIR-IFEAT (PAIR) 
IMAT (IFEAT)  «-  true 
od 
od 
od 
od 

Once  the  Class  I  model  and  image  features  have  been 
exhausted,  all  pairs  of  Class  II  model  and  image  features 
that  have  not  already  been  accounted  for  (are  not  marked 
in  IMAT)  are  used  to  solve  for  potential  alignments. 

The  alignment  computation  specifies  two  possible  trans¬ 
formations,  that  differ  by  a  reflection  of  the  model  about 
the  three  model  points.  A  further  ambiguity  may  be  intro¬ 
duced  by  not  knowing  the  exact  correspondence  between 
the  points  in  the  model  feature  and  the  points  in  the  im¬ 
age  feature.  For  instance,  for  a  feature  that  defines  two 
points  and  two  orientations,  there  are  two  possible  corre¬ 
spondences  between  the  a  feature  and  the  image  feature. 
Sometimes  it  is  possible  to  discard  one  of  these  two  pos¬ 
sible  correspondences  based  on  the  kind  of  partial  edges 
(arc  versus  straight)  at  each  end.  Thus  for  a  pair  of  Class 
I  model  and  image  features  (procedure  ALIGN  1)  there  are 
either  two  or  four  possible  transformations  from  the  model 
to  the  image,  whereas  for  two  pairs  of  Class  II  model  and 
image  features  (procedure  ALIGN2)  there  are  two  possible 
alignments. 

To  verify  an  alignment,  (procedure  VERIFY),  each  model 
feature  (both  Class  I  and  Class  II)  is  transformed  into  im¬ 
age  coordinates  to  determine  if  there  is  a  corresponding 
image  feature.  The  image  features  are  stored  in  a  table 
according  to  position  and  orientation,  so  only  a  small  num¬ 
ber  of  image  features  are  considered  for  each  model  feature. 
The  verification  procedure  requires  a  good  correspondence 
between  a  model  feature  and  an  image  feature,  not  just  a 


high  correlation  of  the  model  with  the  image.  For  example, 
an  image  edge  contour  that  continues  beyond  the  end  of  a 
model  edge  contour  is  not  a  good  match,  whereas  one  that 
coterminates  is.  Similarly,  if  an  image  edge  crosses  a  model 
edge,  then  that  model  edge  is  not  well  matched.  This  is 
important  in  complex  images,  of  the  kind  the  ORA  system 
has  been  tested  on  [9]. 

In  the  current  implementation,  a  model  feature  has  a 
match  if  there  is  a  corresponding  image  feature  of  the  same 
type,  approximate  position  (within  10  pixels),  and  approxi¬ 
mate  orientation  (within  tt/10  radians).  A  given  alignment 
is  verified  if  more  than  a  certain  percentage  (currently  half) 
of  the  transformed  model  features  have  a  corresponding  im¬ 
age  feature. 

When  an  alignment  matches  a  model  to  an  image,  each 
image  feature  that  is  matched  to  a  model  feature  is  taken  to 
be  accounted  for  by  that  model,  and  is  removed  from  fur¬ 
ther  consideration  by  the  matcher  (by  marking  it  in  IMAT). 
This  can  greatly  reduce  the  set  of  matches  considered,  but 
at  the  cost  of  possibly  missing  a  match  in  the  rather  un¬ 
likely  event  that  the  features  from  a  given  instance  of  an 
object  in  an  image  are  incorrectly  incorporated  into  verified 
alignments  of  other  objects. 

The  worst  case  running  time  of  the  matching  process 
is  0(m3i2)  for  m  model  features  and  i  image  features.  This 
is  because  all  the  features  may  be  of  Class  II,  and  it  may  be 
necessary  to  consider  matching  each  pair  of  model  features 
against  each  pair  of  image  features,  which  is  m2i 2 .  Scoring 
each  alignment  then  involves  transforming  each  model  fea¬ 
ture  to  image  coordinates,  which  is  another  m  operations. 
In  practice,  however,  it  is  not  necessary  to  consider  all  pairs 
of  features,  because  a  given  alignment  will  eliminate  other 
image  features  from  consideration. 

Each  alignment  computation  is  independent  of  all  the 
others,  so  the  alignments  can  all  be  computed  in  parallel 
on  a  massively  parallel  machine  such  as  the  connection  ma¬ 
chine  [7].  This  computation  is  constant  time  as  long  as 
the  number  of  alignments  to  be  computed  does  not  exceed 
the  number  of  processors.  All  the  alignments  can  then  be 
verified  in  parallel,  which  will  take  0(m )  time,  assuming 
there  is  one  processor  per  alignment  computation,  because 
each  of  the  m  model  features  must  be  considered  for  each 
alignment. 

3.2  Examples 

The  recognizer  has  been  tested  using  some  images  of  sim¬ 
ple  polyhedral  objects  in  relatively  complex  scenes,  taken 
under  normal  lighting  conditions  in  offices  or  laboratories. 
While  the  representation  is  not  constrained  to  polyhedra, 
it  is  currently  difficult  to  enter  the  three-dimensional  coor¬ 
dinates  of  non-polyhedral  objects  in  order  to  form  models. 


Figure  6  and  Figure  7  show  some  examples  of  the 
recognizer.  Part  i)  of  each  figure  shows  the  grey-level  image, 
part  ii)  showB  the  Canny  edges,  part  iii)  shows  the  primitive 
features  (straight  edges  are  in  bold  and  corners  are  marked 
by  dots),  and  part  iv)  shows  all  the  recognized  instances  of 
the  models  in  the  image. 

There  are  several  hundred  alignment  features  in  each 
of  these  images,  so  many  thousand  possible  positions  and 
orientations  of  each  object  model  were  considered.  All  the 
hypotheses  that  survived  verification  are  shown  in  part  iv) 
of  the  examples.  The  matching  time  (after  feature  extrac¬ 
tion)  is  2-3  minutes  per  image  on  a  Symbolics  3650.  About 
one  third  of  the  time  is  spent  computing  alignments,  and 
the  remainder  is  taken  up  transforming  model  features  and 
looking  up  corresponding  image  features  for  verification. 

As  can  be  seen  in  the  examples,  the  determination  of 
position  and  orientation  is  quite  good,  but  does  not  bring 
the  entire  model  into  exact  correspondence  with  the  im¬ 
age.  For  tasks  where  a  more  exact  alignment  is  required, 
a  least  squares  or  other  error  minimization  procedure  may 
be  used  to  improve  the  estimate  of  position  and  orienta¬ 
tion.  This  refined  pose  estimate  can  be  computed  using 
perspective  projection,  rather  than  weak  perspective,  for 
even  higher  accuracy.  The  idea  is  to  use  alignment  under 
weak  perspective  to  find  nearly  correct  positions,  because  it 
is  relatively  computationally  efficient.,  and  then  to  perform 
less  efficient  but  more  accurate  positioning  only  for  verified 
alignments. 


4  Summary 

We  have  shown  that  under  the  weak  perspective  imaging 
model,  three  corresponding  model  and  image  points  are 
necessary  and  sufficient  to  align  a  rigid  solid  object  with 
an  image  (up  to  a  reflective  ambiguity).  We  give  a  closed 
form  solution  for  computing  the  alignment  from  any  triple 
of  model  and  image  points.  In  contrast,  other  3D  from 
2D  recognition  systems  generally  use  approximate  solution 
methods  rather  than  solving  directly  for  all  possible  trans¬ 
formations.  Furthermore,  systems  that  use  a  perspective 
projection  model  of  imaging  require  at  least  six  pairs  of 
corresponding  points  to  compute  the  transformation. 

The  ORA  system  uses  one  or  two  corresponding  model 
and  image  features  to  align  a  model  with  an  image,  and  then 
verifies  the  alignment  by  transforming  the  model  features 
into  image  coordinates.  The  features  are  based  on  a  small 
number  of  local  edge  properties,  making  them  relatively  re¬ 
liable  under  sensor  error  and  partial  occlusion.  The  fact 
that,  we  have  identified  the  minimum  amount  of  informa¬ 
tion  necessary  to  recover  position  and  orientation  makes  it 


possible  to  specify  features  that  are  both  recognizable  in 
real  images  and  sufficient  to  align  a  model  with  an  image. 

Two  classes  of  features  are  used,  those  that  define  three 
points,  and  those  that  define  an  oriented  point.  A  single 
three-point  feature,  or  a  pair  of  oriented-point  features,  is 
sufficient  to  hypothesize  a  potential  position  and  orienta¬ 
tion  of  a  model  with  respect  to  an  image.  Thus  there  are  at 
most  0(m2i2)  possible  alignments  to  consider  for  m  model 
features  and  i  image  features.  Verifying  each  alignment 
then  takes  0(m)  time  to  transform  each  model  feature  to 
image  coordinates.  We  have  shown  some  examples  of  rec¬ 
ognizing  objects  in  relatively  complex  indoor  scenes,  under 
normal  lighting  conditions. 
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Abstract 

A  major  goal  of  computer  vision  is  object  recognition,  which 
involves  matching  of  images  of  an  object,  obtained  from  differ¬ 
ent,  unknown  points  of  view.  Since  there  are  infinitely  many 
points  of  view,  one  is  faced  with  the  problem  of  a  search  in 
a  multidimensional  parameter  space.  A  related  problem  is  the 
stereo  reconstruction  of  3-D  surfaces  from  multiple  2-D  images. 
We  propose  to  solve  these  fundamental  problems  by  using  ge¬ 
ometrical  properties  of  the  visible  shape  that  are  invariant  to 
a  change  in  the  point  of  view.  To  obtain  such  invariants,  we 
start  from  classical  theories  for  differential  and  algebraic  invari¬ 
ants  not  previously  used  in  image  understanding.  As  they  stand, 
these  theories  are  not  directly  applicable  to  vision.  We  suggest 
extensions  and  adaptation  of  these  methods  to  the  needs  of  ma¬ 
chine  vision.  We  study  gw-.eral  rroje-tive  transformation,  which 
include  both  perspective  and  orthographic  projections  as  special 
cases. 


I.  INTRODUCTION 

A  major  goal  of  any  vision  system  (natural  or  machine)  is 
to  identify  objects  that  are  visible  in  the  scene,  regardless  of 
their  position,  orientation  or  scale  relative  to  the  viewer  (scale 
being  dependent  on  distance).  The  shape  of  an  object  is  in¬ 
dependent  of  the  point  of  view  from  which  it  is  seen,  namely 
the  viewer-centered  coordinates,  but  its  visible  image  is  highly 
viewer-dependent.  This  raises  the  problem  of  how  to  separate 
the  intrinsic  characteristics  of  the  object  from  the  incidental 
characteristics  of  the  point  of  view.  When  building  a  data  base 
of  objects,  for  instance,  we  would  like  to  be  able  to  store  only 
the  intrinsic  properties  of  the  objects.  These  intrinsic  properties 
will  be  invariants,  in  the  sense  that  they  can  be  calculated  from 
measurements  taken  in  any  egocentric  coordinate  system,  but 
will  be  independent  of  the  particular  system.  A  simple  exam¬ 
ple  is  the  length  of  a  rod,  which  is  invariant  under  rotation.  In 
a  simple  world  consisting  of  rods  that  lie  in  a  plane,  and  with 
images  taken  orthographically,  one  can  identify  a  particular  rod 
by  measuring  its  length  on  the  image  and  comparing  it  to  some 
data  base  of  rods’  lengths.  The  irrelevant  rod’s  orientation  can 
be  ignored. 

In  current  methods  of  building  a  data  base  of  shapes  [Bal¬ 
lard  and  Brown,  1984]  an  image  is  taken  from  a  particular  point 
of  view.  Then,  to  match  it  to  objects  seen  from  a  different, 
unknown  view,  a  search  has  to  be  made  in  the  space  of  the 
parameters  determining  the  relative  point  of  view,  such  as  ro¬ 
tation  angles,  scale,  and  perspective  parameters,  and  a  match 
criterion  has  to  be  examined  for  each  point  iu  this  parameter 
space.  This  is  very  time  consuming.  Using  invariants  can  elimi¬ 
nate  this  search,  considerably  reducing  the  amount  of  time  and 
memory  needed  to  recognize  the  object. 


Other  methods  attempting  to  reduce  the  search  size  have 
been  employed,  for  instance  using  Fourier  transforms  of  closed 
curves  or  surfaces,  or  Hough  transforms  that  match  many  ob¬ 
jects  simultaneously.  However,  besides  being  restricted  in  their 
domain  of  applicability,  none  of  these  methods  get  rid  of  all  the 
undesirable  parameters;  one  still  has  to  search  in  a  multidimen¬ 
sional  space. 

Starting  on  the  most  fundamental  level,  the  issue  arises  of 
what  kind  of  invariants  are  useful  in  vision. 

A  general  consideration  is  that  invariants  that  are  more 
general,  i.e.  that  stay  invariant  under  a  wider  class  of  trans¬ 
formation,  are  fewer  than  “weaker”  invariants.  For  instance, 
the  length  of  the  rod,  a  rotational  invariant,  is  not  an  invari¬ 
ant  under  the  more  general  projective  transformation.  If  vision 
were  concerned  only  with  Euclidean  transformations,  it  would 
make  little  sense  to  generalize  it  to  projective  transformations. 
However,  many  vision  problems  are  concerned  with  perspective 
transformation,  such  as  in  the  case  of  a  picture  taken  of  a  planar 
object,  when  the  planes  of  the  object  and  the  film  are  not  paral¬ 
lel.  Projective  transformations  are  the  smallest  group  that  con¬ 
tain  both  perspective  and  Euclidean  transformations  as  subsets, 
and  hence  their  importance.  Apart  from  the  invariants  issue, 
using  projective  geometry  can  unite  and  simplify  the  treatment 
of  perspective  and  orthographic  projections,  which  are  usually 
treated  separately  [e.g.  Horn,  1986). 

The  case  mentioned  above  of  imaging  a  planar  object  is  of¬ 
ten  applicable  to  3-D  objects,  since  many  objects  contain  planar 
shapes,  such  as  facets,  symmetry  planes  etc.,  which  are  gener¬ 
ally  projected  onto  the  image  perspectively.  Thus,  2-D  projec¬ 
tive  transformation  and  their  invariants  have  to  be  called  on  for 
recognition  of  many  3-D  objects.  For  planar  shapes,  one  needs 
only  one  point  of  view  to  measure  its  projective  invariants.  The 

2- D  case  will  be  treated  in  Section  3. 

In  the  full  3-D  case,  one  has  to  separate  two  different  issues: 
“object  recognition”  and  “3-D  shape  reconstruction”.  In  the 
former,  we  deal  purely  with  3-D.  We  assume  that  a  visible  3- 
D  shape  has  already  been  reconstructed  from  images  and  any 
other  clues,  and  we  want  to  match  it  to  a  database  of  known  3-D 
objects.  The  stored  shape  is  likely  to  differ  from  the  observed  one 
by  rotation,  translation  and  dilation  transformations,  so  we  need 
invariants  to  those,  and  the  demand  for  projective  invariants  is 
stronger  than  we  really  need  in  this  case.  There  exist,  however 
such  invariants  in  3-D  and  they  are  treated  in  Section  7. 

In  most  applications,  however,  it  is  hard  to  isolate  the  pure 

3- D  recogniton  problem  since  it  is  coupled  with  the  problem  of 
reconstruction  from  2-D  images.  In  this  case,  projective  geom¬ 
etry  is  very  helpful  in  taking  advantage  of  (3-D)  contours  that 
might  be  visible  on  the  object.  We  can  show  that  in  general 
a  3-D  curve,  such  as  a  contour,  can  be  reconstructed  from  two 
2-D  images,  without  any  need  for  matching  or  registration  of 
individual  points  on  the  curve.  By  the  same  method  we  can 
recognize  the  curve  and  distinguish  it  from  any  other  curve  (up 
to  a  projective  transformation).  (We  exclude  some  degenerate 
cases.)  Other  methods  do  rely  on  point-by-point  matching, 
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which  is  a  difficult  and  largely  unsolved  problem.  3-D  curves 
are  treated  in  Sections  3,4,5. 

Two  main  kinds  of  projective  invariant  theories  have  tra¬ 
ditionally  been  studied:  differential  and  algebraic  invariants. 
Halphen  [1882],  Wilczynski  [1906,1907,1908],  Lane  [1932,1941] 
and  Fubini[1927]  dealt  with  differential  invariants.  These  are 
quantities  that  are  based  on  local  properties  of  a  shape,  such  as 
derivatives.  For  vision,  these  are  good  starting  points,  but  sev¬ 
eral  problems  arise  in  applying  their  theories.  First,  one  has  to 
assume  that  the  functions  describing  the  shape  (curve,  surface) 
are  sufficiently  differentiable,  and  sometimes  up  to  fourth  order 
derivatives  are  needed.  In  practice,  however,  such  derivatives 
cannot  be  obtained  with  satisfactory  reliability.  Second,  as  the 
invariants  are  local,  one  hits  to  know  the  local  correspondence 
between  points  of  the  images  obtained  from  different  viewpoints. 
Point  correspondence  in  a  very  difficult  and  generally  unsolved 
problem  in  vision.  Thus,  the  classical  theory  of  differential  in¬ 
variants,  while  providing  a  good  basis  for  further  development, 
cannot,  as  it  stands,  be  applied  directly.  We  study  several  ways 
to  solve  the  above  problems.  Among  them:  (1)  The  shape  can 
be  expressed  as  a  superposition  of  some  basis  functions.  The 
usual  methods  such  as  Fourier  components  or  splines  are  only 
partially  useful,  and  it  is  preferable  to  use  basis  functions  which 
are  themselves  invariant.  A  related  method  is  smoothing  or  blur¬ 
ring  by  some  filter  such  as  a  Gaussian,  and  again  we  prefer  an 
invariant  filter.  As  the  basis  functions  are  analytic,  this  superpo¬ 
sition  can  be  differentiated  reliably.  (2)  The  correspondence  and 
matching  problems  will  be  solved  by  going  over  to  an  “invariant 
space”.  (3)  In  the  integral  features  approach  (e  g.  Amari  [1974]) 
one  integrates  over  the  local  invariants,  and  obtains  global,  or 
integral,  invariants,  such  as  moments.  The  problem  is  to  find 
suitable  quantities  to  integrate  on.  Some  Euclidean  moment  in¬ 
variants  have  been  discussed  by  [Reddi,  1981],  Those  methods 
are  described  in  Section  5. 

The  other  classical  theory  is  that  of  algebraic  invariants. 
(See,  e.g.,  [Springer,  1964].)  It  involves  shapes  consisting  of  ho¬ 
mogeneous  n-dimensional  polynomials,  such  as  conics,  quadrics, 
etc,  or  systems  of  such  components.  Under  projectivity,  some 
properties  of  such  a  shape  are  preserved.  However,  most  shapes 
of  interest  are  neither  polynomials,  nor  systems  of  polynomials, 
so  this  theory  is  not  directly  applicable  to  vision.  To  make  use 
of  the  theory,  we  employ  a  new  decomposition  of  shapes  into 
polynomial  segments ,  reported  elsewhere  [Weiss  and  Rosenfeld, 
1986].  This  new  decomposition,  apart  from  enabling  us  to  apply 
the  theory  of  algebraic  invariants  to  general  shapes,  has  numer¬ 
ous  other  advantages,  such  as  being  itself  projectively  invariant 
(i.e.  the  positions  of  the  breakq  points  on  the  curve  do  not 
change).  It  can  also  be  obtained  very  rapidly.  These  proper¬ 
ties  also  make  it  advantageous  to  use  our  new  decomposition, 
instead  of  splines,  in  the  case  of  differential  invariants  discussed 
above. 

Unlike  the  differential  invariants  which  change  from  one 
point  to  another,  the  algebraic  ones  are  constants  for  whole  seg¬ 
ments  of  the  shape,  thus  we  shall  call  them  “regional”  to  distinct 
them  from  the  local  (differential)  ones.  The  method  can  also  be 
integrated  to  yield  global  invariants. 

Apart  from  the  geometrical  invariants  discussed  so  far,  wh¬ 
ich  are  intrinsic  to  the  shape,  one  can  make  use  of  extrinsic 
invariants.  These  are  functions  defined  on  points  of  the  shape, 
and  are  obtained  from  measurements  having  nothing  to  do  with 
the  geometry,  thus  being  independent  of  the  coordinate  system. 
Examples  are  the  reflectance  of  a  surface,  or  its  color.  Shading  is 
also  independent  of  the  point  of  view,  provided  the  reflectance 
is  Lambertian  and  the  lighting  remains  constant  with  respect 
to  the  object.  These  quantities  are  usually  local  (and  thus  less 
sensitive  to  noise).  In  trying  to  make  them  more  global,  the 
geometry  of  the  problem  enters.  We  can  apply  the  method  de¬ 
veloped  for  the  geometric  invariants  to  make  the  extrinsic  invari¬ 
ants  more  global.  For  instance,  we  can  define  moments  of  these 
quantities  in  an  invariant  way.  In  some  cases,  such  as  planar 


texture,  this  may  be  the  only  reasonable  invariant  treatment,  as 
the  geometry  of  a  plane  is  not  very  informative. 

Of  some  use  is  an  additional  kind  of  invariant,  the  well 
knewu  cross  ratio  of  four  points.  The  invariance  of  this  ratio 
has  led  to  the  use  of  invariant  coordinates,  in  which  the  position 
of  a  point  is  measured  by  the  cross  ratio  that  this  point  makes 
with  three  known  reference  points.  An  account  of  the  use  of 
this  method  in  vision  can  be  found  in  [Duda  and  Hart,  1973]. 
However,  the  problem  of  finding  correspondences  between  refer¬ 
ence  points  in  different  views  again  prevents  the  wide  use  of  this 
method,  although  in  some  cases  statistical  methods  can  be  used 
[Chang,  Davis,  Dunn,  Eklundh  and  Rosenfeld,  1986]. 

The  next  section  will  briefly  review  the  basic  projective  ge¬ 
ometric  tools  needed  later.  The  following  sections  will  deal  with 
plane  and  space  curves,  developing  the  various  invariants  de¬ 
scribed  above.  The  last  sections  will  deal  with  surfaces. 

2.  GEOMETRICAL  PRELIMINARIES 
2.1  Homogeneous  Coordinates 

Projective  transformations,  or  projectivities,  can  be  defined 
geometrically  in  n-dimens-ional  space,  and  are  easiest  to  visual¬ 
ize  for  2-D  as  a  projection  by  means  of  a  pointlike  light  source  of 
one  plane  onto  another.  The  corresponding  algebraic  expressions 
are  non-linear  in  Cartesian  coordinates  (homography).  Going 
over  to  the  so-called  homogeneous  coordinates  yields  perspectiv- 
ities,  and  makes  it  possible  to  develop  a  rather  complete  theory 
of  projective  invariants. 

The  homogeneous  coordinates  of  a  point  (x,  y,  z)  are  defined 
as  follows:  A  point 

X=  (Z|,Zj,Zs,Z4) 

in  4-D  space  corresponds  to  (x,  y,  z)  in  3-D  space  if 
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Of  course  this  definition  is  not  unique,  as  the  x,-  can  be  mul¬ 
tiplied  by  a  common  factor  and  still  correspond  to  the  same 

!z,y,z).  In  fact,  a  whole  4-D  line  corresponds  to  a  3-D  point. 
A  useful  invariant  should  be  unaffected  by  this  multiplication, 
as  well  as  being  unchanged  by  transformations  in  the  original 
3-D  space.)  For  a  point  with  z4  =  0,  the  division  is  not  possible 
and  such  a  point  does  not  correspond  to  a  point  in  Euclidean 
space.  Intuitively,  it  is  a  point  at  infinity.  The  point  (0,0, 0,0)  is 
excluded  from  the  space.  For  projective  geometry  in  the  plane, 
Zj  plays  the  role  of  x\  above. 

It  can  easily  be  shown  that  a  general  projective  transfor¬ 
mation  corresponds  to  a  general  linear  transformation  in  the 
homogeneous  coordinate  space,  i.e.  to  every  projectivity  in  n-D 
space  corresponds  a  linear  matrix  in  (n  +  1)-D  space,  defined  up 
to  a  multiplicative  factor. 

2.2  Invariants  and  Covariants 

Let  us  define  these  concepts  involving  invariance  more  ac¬ 
curately,  for  general  transformations  (not  only  projectivities). 
The  geometric  shape  itself  is  a  fixed  entity  in  space,  but  its 
algebraic  representation  necessitates  choosing  some  coordinates 
and  parameters,  and  it  is  their  transformation  which  raises  the 
invariance  issue.  A  shape  can  be  represented  by  a  relation 

3>(n*,z,)  =  0  (2.1) 

The  at  may  include  parameters  which  are  defined  on  the  shape, 
or  other  coefficients.  Examples  are  a  parameter  t  along  a  curve, 
or  the  coefficients  of  a  conic.  The  x,  are  the  space  coordinates 
of  a  point  on  the  shape.  The  function  $  defines  the  shape  by 
relating  the  shape  parameters  a*  to  the  spare  coordinates.  The 
theories  of  invariance  begin  by  finding  a  function  that  pre¬ 
serves  its  form  under  the  transformation  of  either  a*  or  x*  in 


vttgfvS 


which  we  are  interested.  For  instance,  a  conic,  which  is  a  second 
order  homogeneous  polynomial  (in  homogeneous  coordinates), 
preserves  this  form  under  projectivity.  In  the  differential  the¬ 
ory,  $  is  a  linear  differential  equation.  Generally,  when  the  Xi 
are  transformed  to  Si,  one  substitutes  in  $  the  x,  as  functions 
of  the  new  coordinates  Si .  After  suitable  rearrangements  one 
obtains  a  function  of  the  same  form  The  same  applies  to  a 
transformation  of  a*. 

An  invariant  is  a  function  of  the  parameters  a k  which,  under 
the  transformation,  changes  in  a  way  that  depends  only  on  the 
transformation  constants  (but  not  on  a*  or  x, .)  I.e.,  the  change, 
if  any,  is  independent  of  the  shape.  This  is  in  general,  a  relative 
invariant.  If  there  is  no  change  at  all,  it  is  an  absolute  invariant. 

Often,  a  relative  invariant  involves  only  the  Jacobian  of  a 
transformation.  In  this  case  one  defines  a  relative  invariant  of 
weight  w  as  a  function  of  the  parameters,  /(a*,),  such  that 

I(ak)  =  J-I(ak) 

where  J  is  the  Jacobian  of  the  transformation.  When  w  =  0,  so 
that  1=1,  the  invariant  is  absolute. 

For  dealing  with  integral  functionals  such  as  moments  or 
features,  we  shall  need  covariants.  A  covariant  contains  the 
coordinates  x,  as  well  as  ak.  We  thus  define  a  covariant  of 
weight  w  and  degree  d  as  a  function  of  the  parameters  and  the 
coordinates  C(ak,x,)  having  the  properties 

C(ak,xt)  =  JwC(ak,Xi) 

C(ak,Xxi)  =  XdC(ak,Xi) 

The  first  relation  is  similar  to  the  one  for  the  invariants.  The 
second  means  that  if  all  the  coordinates  Xi  are  multiplied  by 
a  common  factor  A(at) ,  C  is  multiplied  by  the  same  factor,  to 
some  power  d.  This  latter  transformation  property  is  typical  of 
any  homogeneous  function  of  degree  d.  A  covariant  of  degree 
zero  is  an  invariant.  <t>  itself  is  a  covariant. 

The  geometric  invariants  for  3-D  surfaces  are  derived  from 
the  mathematical  representation  of  the  shape  in  4-D.  The  ex¬ 
trinsic  invariants  are  given  in  3-D,  as  functions  /(x,y,z).  They 
can  be  extended  to  4-D  by  substituting  for  x,  y,  z  their  homo¬ 
geneous  equivalents,  xi/x<  ■  •  •.  Thus,  /  will  be  constant  on  the 
line  in  4-D  corresponding  to  a  point  (x,  y,  z)  in  3-D. 

Although  in  general  vision  theory  is  interested  in  visible 
surfaces,  curves  are  also  very  important,  for  example  as  dis¬ 
continuities  on  surfaces,  or  occluding  contours.  The  invariance 
properties  of  contours  will  also  help  us  in  the  stereo  reconstruc¬ 
tion  problem.  In  Section  3-5  we  discuss  invariants  of  curves  and 
their  role  in  vision.  In  Section  7-8  we  similarly  discuss  invariants 
of  surfaces. 


3.  DIFFERENTIAL  INVARIANTS  OF  CURVES 
3.0  A  1-D  Differential  Invariant 

We  first  mention  a  well  known  one  dimensional  differen¬ 
tial  invariant,  namely  the  Schwarzian  derivative  [Springer  1964]. 
Consider  a  particle  moving  along  a  straight  line,  with  its  posi¬ 
tion  at  a  time  t  measured  by  a  (non-homogeneous)  coordinate 
r(t)  .  The  Schwarzian  derivative  S(r)  is  defined  as 
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and  it  is  invariant  under  projective  transformations  of  the  line 
and  under  change  in  the  parameter  t,  as  can  be  checked  directly. 
Furthermore,  the  differential  equation 

S(r)  =  g(t) 


where  g(t)  is  given,  determines  the  relation  r(t)  up  to  (1-D) 
projectivity. 

3.1  General  Curves  -  Introduction 

For  higher  dimensions,  the  theory  is  a  little  less  immediate. 
It  is  summarized  below.  We  work  in  homogeneous  coordinates. 

A  curve  can  be  represented  parametrically  as  a  function 

x(t)  =  x,(t) 

where  i  =  1,2, 3, 4  for  space  curves,  and  «  =  1,2,3  for  plane 
curves.  The  first  step  is  finding  a  projective  invariant  relation 
that  a  curve  satisfies  (the  $  =  0  in  Section  2.2).  Such  a  rela¬ 
tion  has  been  found  in  the  form  of  a  differential  equation  whose 
solution  is  the  curve  x,(t).  At  first  one  may  question  the  wis¬ 
dom  of  replacing  a  simple  explicit  representation  of  the  curve  by 
a  differential  equation.  The  advantage  can  be  seen  intuitively 
as  follows:  when  one  differentiates  a  function,  certain  constants 
are  eliminated.  These  constants  are  usually  not  invariant,  while 
some  relations  between  the  derivatives  are  intrinsic  to  the  curve 
and  are  invariant.  As  a  simple  example,  the  equation  x"  =  0 
represents  all  straight  lines,  regardless  of  their  slope  and  inter¬ 
cept,  and  only  straight  lines.  The  “straightness”  property  is 
invariant  under  projectivity,  but  the  slope  and  intercept  depend 
on  the  coordinates. 

It  can  be  shown  that  a  general  curve  in  n-D  homogeneous 
space  satisfies  the  linear  differential  equation 

x(n)  +  (”)  +  Q  p2x("-J>  +  •  •  •  +  p„x  =  0  (3.1) 

which  is  projectively  invariant,  x^"*  means  the  n-th  derivative 
of  x(t).  (The  same  equation  is  satisfied  by  each  component  x,(t) 
of  the  vector  x(t),  where  p,(f)  are  scalar  functions  of  t.) 

The  following  theorem  is  fundamental  to  the  theory  of  in¬ 
variants: 

Theorem.  The  most  general  (non-degenerate)  solution  of  equa¬ 
tion  (3.1)  is  a  curve  in  n  -  1  dimensional  space,  determined  up 
to  a  projective  transformation. 

Thus,  an  equation  of  the  form  (3.1)  can  be  identified  with 
a  class  of  curves  that  differ  only  by  a  projectivity,  and  studying 
invariants  amounts  to  studying  the  coefficients  p,  of  this  equa¬ 
tion. 

Given  the  parametric  representation  x<(l),  the  coefficients 
Pi(t )  can  easily  be  calculated  in  the  following  way.  Substitute 
the  n  functions  x<(t)  one  at  the  time  in  (3.1).  The  result  is  n 
algebraic  equations,  at  each  point  t,  in  the  n  unknowns  p,(t). 
Solve  this  linear  system  of  equations.  The  solution  always  exists 
for  a  proper  curve  (i.e.  one  that  does  not  degenerate  to  a  lower 
order  entity  such  a  point),  since  the  determinant  of  this  system 
is  non-zero.  To  exlude  degeneracy,  it  is  sufficient  to  demand  that 
the  curve  does  not  satisfy  any  first  order  differential  equation. 

Although  the  coefficients  pi(t)  are  absolute  invariants  un¬ 
der  projectivities,  they  still  depend  on  some  arbitrary  choices 
that  we  have  made  in  the  representation.  First,  one  has  the 
parameter  t.  This  parameter  is  chosen  arbitrarily  and  will  not 
necessarily  be  the  same  for  different  observers.  Thus  we  need 
functions  of  the  p,s  that  are  invariant  under  transformation  of 
the  parameter.  Second,  the  coordinates  can  be  multiplied  by 
a  common  factoi  A.  If  This  factor  is  constant  throughout  the 
shape,  there  is  no  problem  since  its  change  is  a  trivial  (iden¬ 
tity)  projectivity,  but  when  it  depends  on  the  parameters  the 
invariance  is  lost.  However,  some  quantities  made  up  of  the  p,s 
and  their  derivatives  are  invariant  and  covariant  with  respect  to 
these  transformations,  and  they  will  be  given  next. 

C.2.  Fir.”?  Cun*''" 

For  a  plane  curve  the  projective  invariant  differential  equa¬ 
tion  is 


* 


vw 


z"'  +  3  pix"  +  3pji'  +  psx  =  0 
(A  vector  equation.)  One  can  define  the  quantities 

Pi  =  Pi  -  Pi  -  pi 
Ps  =  Ps~  3piPi  +  2pJ  -  pi 

These  quantities  can  be  called  semi-invariants.  They  remain 
unchanged  under  multiplication  of  the  coordinates  by  a  factor 
A (i),  but  not  under  change  of  the  parameter  t. 

The  invariants  are 

e*  =  p»  - 1  Pi 

e„  =  ee3ei  -  7(ej)*  -  27P,8| 

The  subscripts  3  and  8  of  0  indicate  the  weights  of  these  in¬ 
variants.  (Under  change  of  the  parameter  £,  they  transform  as 
0„  =  (dt/dt)wQw>  where  t(t)  is  the  new  parameter  along  the 
curve,  and  w  is  the  negative  of  the  subscript.)  03  is  the  only 
linear  one.  All  other  invariants  can  be  derived  from  03,8s  and 
their  derivatives.  We  can  thus  call  the  a  above  two  invariants  a 
complete  set  of  independent  invariants. 

Furthermore,  the  following  theorem  will  be  of  importance: 

Theorem.  The  invariants  03, 0g  completely  determine  a  plane 
curve  up  to  a  projective  transformation. 

Examples  of  other  (dependant)  invariants  that  might  be 
useful  are 

8„  =  3030'8  ~  80801, 

018  =  030'lJ  -  401Je's 

0J1  ~  208013  ~  301308 
among  which  there  is  the  nonlinear  relation 

013  +  03021  -  208016  =  0 
We  now  turn  to  the  covariants.  We  define 
z  i  =  x'  +  pix 
zt  =  x"  +  2pix'  +  pjx 

As  these  Zi  are  homogeneous  polynomials  of  degree  one,  the 
quantities  Zi/x  are  semi-covariants.  The  covariants  are 

Cj  =  z\  -  2xzj  -  Pjx1 

C4  =  0ji  +  303zi 

The  subscripts  of  C  indicate  the  weights  of  the  covariants.  The 
degree  is  the  degree  of  x  in  the  C,,  which  are  homogeneous. 
This  is  a  complete  set.  All  other  covariants  can  be  expressed  as 
functions  of  these  and  invariants. 

3.3.  Space  Curves 

We  can  proceed  in  a  similar  fashion  to  deal  with  space 
curves.  The  differential  equation  is 

+  4p\x"'  +  6  pix"  +  4p3x'  +  p«z  =  0 

The  semi-invariants  are 

Pi  =  pi-  p[  -  p\ 

Pi  =  P3  -  p'i  ~  3pipj  +  2 p\ 

Pa  =  Pa  -  4pips  -  3p3  +  12p}pi  -  6pJ  -  p'i" 


The  invariants  are 

03  =  Pi  ~  \P‘l 

04  =  Pa  ~  IP's  +  \PZ  ~  ~PjJ 

1  no 

0s  =  6030"  -  703 2  -  —  Pj0| 

All  other  invariants  are  functions  of  the  above  and  their  deriva¬ 
tives.  Furthermore,  these  invariants  completely  characterize  the 
curve,  up  to  a  projective  transformation. 

For  the  covariants,  we  first  define 

Z 1  =  x'  +  PiX 

z2  =  x"  +  2pix'  +  Pix 

Zs  =  x'"  +  3pix"  +  3  p2x'  +  p3x 

with  the  semi-covariants  being  z,/x. 

The  covariants  are 

Cj  =  lOzJ  -  15xz2  -  12 P3x* 

C3  =  lOzf  -  3C2Zi  -  9(5zs  +  GPjZi  +  P3x)x2 
C\  =  203Zi  +  03x 

which  is  also  a  complete  system.  We  turn  now  to  applying  the 
systems  of  invariants  and  covariants  for  vision  purposes. 


4.  CURVE  INVARIANTS  IN  VISION 

The  above  invariants  are  ideally  suited  for  solving  the  prob¬ 
lems  of  matching  of  shapes  stored  in  a  data  base  to  an  observed 
shape,  and  the  stereo  reconstruction  problem.  This  is  because 
they  contain  all  the  information  that  is  intrinsic  to  the  shape  it¬ 
self,  and  none  of  the  information  that  is  particular  to  the  partic¬ 
ular  coordinate  system,  namely  to  the  particular  point  of  view, 
which  stands  in  the  way  of  the  recognition  process. 

Since  high  order  derivatives  are  involved,  the  formulae  above 
cannot  be  used  directly.  In  this  section  we  study  the  method 
of  overcoming  the  problem  by  decomposing  the  functions  into 
simpler  building  blocks,  which  can  be  differentiated  analytically. 
The  advantage  of  the  method  (as  compared  to  integral  methods 
described  later)  is  in  preserving  much  of  the  local  characteristic 
of  the  shape,  making  it  possible  to  deal  with  occluded  shapes. 
It  may  be  a  little  more  complex,  however. 

As  the  Bystem  of  invariants  depends  non-linearly  on  the 
curve’s  coordinates  x,  ((),  this  linear  decomposition  will  not  sim¬ 
plify  it.  The  only  purpose  of  the  decomposition  is  reliably  calcu¬ 
lating  the  derivatives.  One  possible  decomposition  is  the  Fourier 
transform.  However,  it  can  make  the  problem  global  again,  de¬ 
feating  the  purpose  of  handling  occluded  shapes.  Besides,  the 
sines  and  cosines  are  not  projective  invariants.  Another  possi¬ 
ble  superposition  is  of  polynomial  basis  functions,  e  g.  splines, 
which  are  piecewise  analytic.  The  question  arises  here  of  the  de¬ 
gree  of  the  splines.  As  the  differential  equation  contains  a  third 
derivative,  and  the  invariant  functions  have  at  least  one  more, 
one  can  show  that  at  least  a  quintic  spline  is  needed  (for  a  con¬ 
tinuous  fourth  derivative).  The  usual  splines  are  not  projective 
invariants.  However,  since  they  are  polynomials,  their  theory 
can  readily  be  extended  to  one  of  homogeneous  polynomials  in 
projective  coordinates.  In  this  way  they  can  be  turned  into  alge¬ 
braic  invariants  (discused  in  Section  6).  Thus,  the  homogeneous 
polynomial  splines  are  particularly  suited  for  representing  a  vi¬ 
sual  shape  invariantly. 

Once  the  derivatives  are  known,  they  can  be  substituted  in 
the  differential  equation  to  find  the  coefficients  p,(f).  This  step 
is  quite  trivial,  as  we  have  a  low-order  system  of  linear  equations. 


It  can  be  simplified  even  further  by  choosing  the  last  coordinate, 
i.e.  xj  in  the  plane  or  x4  in  3-D  space,  as  a  constant.  This  will 
result  in  ps  (or  p4)  being  identically  zero. 

In  the  following  step,  invariants  and  covariants  can  be  calcu¬ 
lated  in  a  straightforward  manner  from  the  pt ,  using  the  formulae 
given  above.  All  the  invariants  given  there  are  relative,  i.e.  they 
contain  the  Jacobian  of  the  transformation.  We  are  interested 
now  only  in  absolute  invariants,  and  they  can  be  obtained  as 
combinations  of  the  relative  ones. 

Among  the  possible  absolute  invariants  of  plane  curves  are 


The  first  and  last  quantities  are  in  general  independent, 
as  they  derive  from  four  independent  quantities.  There  are, 
however,  important  degenerate  cases,  to  be  discussed  later.  The 
others  are  independent  too,  in  general,  but  they  need  the  use  of 
higher  derivatives.  For  a  space  curve,  enough  independent  low 
order  invariants  exist  to  satisfy  our  purposes. 

The  final,  and  most  important  step  is  using  the  above  invari¬ 
ants  in  our  specific  vision  problems,  (i)  In  the  pure  recognition 
problem,  one  has  a  curve  in  either  2  or  3-D,  and  the  goal  is  to 
match  it  with  a  curve  that  was  seen  from  a  different  point  of 
view,  e.g.  a  curve  stored  in  a  data  base.  The  dimensionality  of 
the  curve  and  the  space  are  unchanged.  To  do  this,  one  can  plot 
one  independent  invariant  of  the  curve  against  another.  This  will 
be  done  both  for  the  stored  curve  and  the  observed  one.  This 
plot  is  intrinsic  to  the  curve  and  remains  the  same  regardless  of 
any  projective  or  parametric  transformation  that  the  curve  may 
have  undergone.  Furthermore,  since  the  invariants  completely 
characterize  the  curve,  it  is  clear  that  if  the  plots  obtained  from 
the  observed  and  stored  curves  are  the  same,  they  belong  to  a 
similar  curve  (up  to  a  projective  transformation).  Thus  we  have 
found  a  way  of  uniquely  identifying  a  curve,  viewed  in  a  different 
coordinate  system,  without  any  search  in  a  parameter  space. 

(ii)  In  the  reconstruction  problem,  one  has  a  space  curve 
projected  onto  two  planar  images.  One  needs  three  invariants 
to  characterize  a  space  curve,  and  these  can  be  plotted  in  a  3-D 
space.  I.e.,  to  each  space  curve  corresponds  an  “invariant  curve” 
in  anothci  3-D  “invaiiant”  space  whose  coordinates  are  the  three 
invariants.  Now,  given  one  planar  projection  of  the  curve,  the 
3-D  invariants  can  be  determined  only  up  to  one  parameter  (cor¬ 
responding  to  the  unknown  depth).  Therefore,  for  each  point  on 
a  projected  curve,  there  corresponds  a  whole  (curved)  line  in  the 
3-D  space  of  invariants.  The  entire  planar  projection  thus  gen¬ 
erates  a  surface  in  the  invariant  space,  made  up  of  the  lines 
corresponding  to  the  individual  points.  Given  two  planar  im¬ 
ages  of  a  space  curve,  one  can  generate  two  surfaces  in  the  3-D 
invariant  space.  These  surfaces  intersect  in  a  curve,  which  is  the 
invariant  curve  corresponding  to  the  original  space  curve.  (The 
intersection  always  exist  if  the  geometric  curve  exists.)  This  in¬ 
variant  curve  is  sufficient  to  identify  the  original  curve,  similarly 
to  the  plot  in  (i).  We  have  thus  been  able  to  solve  the  stereo 
reconstruction  problem,  along  with  the  recognition  problem,  of 
a  general  (non-  degenerate)  space  curve.  This  was  accomplished 
without  a  costly  search  in  a  multidimensional  parameter  space, 
and  without  facing  the  problem  of  pointwise  matching  or  regis¬ 
tration  of  curves  observed  from  different  views.  The  matching 
is  obtained  automatically  as  a  result  of  the  intersection. 


5.  INTEGRAL  INVARIANTS  OF  CURVES 

In  many  cases  it  useful  to  characterize  a  shape  by  global,  in¬ 
tegral  quantities  such  as  moments,  or  other  “linear  features”.  If 
one  can  define  such  quantities  in  an  invariant  way,  they  can  help 
in  identifying  objects  whose  integral  quantities  are  known.  An 
integral  quantity  of  a  curve  can  be  defined  generally  in  ordinary 
coordinates  a a 


M  =  J  I(x,y,z)m(x,y,z)dl(x,y,z) 

I  is  some  quantity  measured  on  the  surface.  We  will  take  /  to  be 
an  invariant,  intrinsic  or  extrinsic,  m  is  a  predefined  “measuring 
function”,  such  as  a  power  of  x,  if  M  is  to  be  a  moment  am. 
Z(x,  y,  z)  is  an  integration  parameter  along  the  curvt,  su-1,  as  the 
arclength.  Similar  methods  apply  to  convolutions,  in  which  case 
the  integrand  will  contain  a  filter  or  a  kernel  G(x,  y,  z). 

The  above  quantities  are  not  invariant,  because  neither  the 
measuring  function  m  nor  the  length  element  dl  are.  For  M 
to  be  invariant,  the  integrand  ( Imdt )  has  to  be  so.  Also,  the 
quantities  m  and  l  have  to  be  functions  of  the  coordinates;  oth¬ 
erwise  the  integration  is  meaningless,  (m  can  be  constant  a«  a 
special  case,  but  not  dl.)  Thus,  we  have  to  use  the  covariants 
in  this  case,  rather  than  the  invariants,  which,  by  definition,  do 
not  depend  on  the  coordinates.  A  convenient  general  way  to 
obtain  covariant  functions  is  to  define  them  as  functions  of  co¬ 
variant  arguments,  instead  of  the  coordinates  x,  appearing  m,l. 
Consider  the  covariant  vector  C j,  defined  in  Sec  3.3.  This  quan¬ 
tity  is  linear  in  Xj,  so  functions  such  as  moments  defined  with 
powers  of  C j  are  closely  related  to  the  corresponding  functions 
of  Xi.  Therefore  we  shall  use  Cj  to  define  kind  of  “covariant 
coordinates” .  A  minor  problem  arises  from  the  fact  that  unlike 
x,  y,  z,  the  covariant  is  defined  in  homogeneous  coordinates,  and 
one  has  to  make  sure  that  multiplication  of  these  coordinates 
by  a  common  factor  A  will  not  affect  the  result  (i.e.  wc  r.ccd 
a  covariant  of  degree  zero).  Our  Cj,  being  homogeneous  in  x, , 
is  of  degree  1,  i.e.  multiplication  of  x,  by  A  will  result  in  the 
multiplication  of  Gi(ii)  by  A.  We  can  handle  this  problem  by 
dividing  the  first  three  members  of  the  covariant  Cj(x,),  namely 
£1(11,2,3),  by  the  fourth,  C2(x4).  Doing  that,  we  can  define  ar¬ 
bitrary  covariant  functions  m  and  /  as  functions  of  the  “covariant 
coordinates” : 


Cj(xi) 

Ca(x4)’ 


Czjxi) 

C,(x4)’ 


Cz(z») 

Ci(*4) 


These  quantities  are  of  weight  and  degree  zero. 

The  function  /  in  the  integrand  can  be  chosen  as  one  of  the 
remaining  invariants  or  covariants,  or  as  some  extrinsic  invariant 
/(x,  y,  z),  or  even  as  a  constant,  leaving  the  covariant  measuring 
functions  m(x,  y,  z)  to  characterize  the  curve. 

The  integral  “features”  M  can  now  be  written  as 

M  =  J  l(xi)m(x,y,z)dl{x,y,z) 

where  all  arguments  are  implicit  functions  of  the  parameter  t. 

An  alternative  way  to  deal  with  the  non-invariance  of  the 
(unmodified)  line  element  d/(x,y,  z)  is  by  considering  densities, 
p(x,),  of  invariant  quantities,  instead  of  the  quantities  /(x,) 
themselves.  The  product  p(x,y,z)dl(x,y,z )  is  invariant  by  def¬ 
inition  of  the  density  (namely  the  average  /  over  a  line  element 
dl  divided  by  dl.)  A  density  can  be  obtained  either  by  differen¬ 
tiating  an  I  (intrinsic  or  extrinsic),  or  by  measuring  the  density 
of  some  extrinsic  quantity  that  is  known  to  be  invariant,  such  as 
the  number  of  black  dots  on  a  given  white  element  of  the  shape. 
This  number  is  invariant,  even  though  the  shape  element,  and 
thus  the  dots’  density,  are  not  by  themselves  invariant. 

In  this  approach,  the  integral  features  M  can  be  written  as 


M  =  y  ^m(x,y,  z)dl 


or  (not  equivalently) 


mmm 
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M  =  J  ^2  ^m(z, y’  l)dxi 

In  a  variation  of  the  density  method,  the  parameter  t  could 
be  used  in  the  formulae  above  instead  of  1  or  x.  Also,  a  com¬ 
bination  of  an  invariant  and  a  density  (derivative)  of  another 
invariant  could  be  used,  i.e. 

M  =  [  ij^-mda 
J  da 

where  a  is  any  of  the  possible  parameters  /,  x,  t. 

The  above  methods  assume  that  the  covariant  C j(ii)  is  not 
identically  zero,  or  constant,  in  ail  of  its  four  components.  In 
some  degenerate  cases  it  does  vanish  so  that  the  features  M  van¬ 
ish,  for  an  arbitrary  measuring  function  m.  This  vanishing  is  in 
itself  an  invariant  characteristic  of  the  shape  and  can  be  used  to 
identify  it.  An  example  of  this  case  is  a  straight  line.  Intuitively, 
one  can  say  that  there  are  no  geometrical  covariants  of  a  straight 
line,  other  than  constants,  so  we  cannot  find  features,  but  this 
fact  in  itself  says  enough  about  this  shape. 

Extrinsic  invariants  become  particularly  important  when 
the  geometry  of  the  shape  is  degenerate,  e.g.  it  is  a  line  or  a 
plane.  If  an  extrinsic  invariant  /  is  given,  such  as  a  texture  pat¬ 
tern,  one  would  like  to  measure  integral  quantities  of  the  shape, 
e.g.  moments.  The  above  methods  can  be  easily  adapted  to 
these  cases.  The  idea  is  to  use  the  extrinsic  function  /  in  a 
new  space  (/,  x,},  in  v'hich  the  curve  is  represented  by  both  its 
original  geometrical  coordinates,  x,,  and  the  new  “coordinate” 
/.  For  a  generic  function  / (t)  the  curve  will  no  longer  be  de¬ 
generate  and  we  can  use  the  methods  described  above  to  derive 
non-vanishing  integral  quantities  M .  Because  the  shape  has  de¬ 
generated  in  dimensionality,  adding  the  new  dimension  /  will 
not  complicate  the  treatment  beyond  that  of  a  non-degenerate 
shape.  For  example,  a  texture  pattern  on  a  planar  shape  can 
be  represented  as  a  surface  of  grey  levels  in  3-D.  It  is  not  neces¬ 
sary,  however,  to  use  the  full  3-D  invariant  theory  because  the 
/  coordinate  stays  invariant  and  does  not  transform. 

In  the  texture  case  the  domain  of  integration  for  calculating 
the  moments  is  not  obvious.  Some  integral  characteristics,  how¬ 
ever,  can  be  defined  that  are  independent  of  the  domain,  such 
as  the  average  anisotropy. 

The  curve  can  be  degenerate  in  the  (/,  x<)  space  also.  This 
will  happen  if  the  extrinsic  function  /  is  constant  along  a  shape 
which  is  a  straight  line  (or  if  one  has  a  constant  density  in  some 
coordinate  system).  In  this  case  all  possible  Ms  as  defined  above 
will  vanish,  and  that  will  be  the  invariant  characterization  of  the 
shape. 

6.  ALGEBRAIC  INVARIANTS 

The  theory  of  differential  and  integral  invariants  described 
above  is,  in  principle,  complete  and  should  include  the  “regional 
invariants”  described  below  as  a  special  case.  However,  this 
“special  case”  has  the  advantage  of  being  implementable  much 
more  simply  and  rapidly  than  the  general  case,  while  still  being 
applicable  to  a  wide  variety  of  curves  and  surfaces.  It  requires 
neither  finding  derivatives  nor  integrals. 

The  method  is  based  on  the  segmentation  of  a  curve  (or 
surface)  into  segments  of  conic  sections  (or  patches  of  quadric 
surfaces).  The  following  discussion  is  invariant  to  replacement  of 
(planar)  curves  by  surfaces,  conic  sections  by  quadric  patches, 
etc.  This  can  be  done,  up  to  a  desired  tolerance  limit,  very 
rapidly  by  a  method  described  elsewhere  [Weiss  and  Rosenfeld, 
1986],  Very  briefly,  this  segmentation  method  is  based  on  a  well 
known  geometrical  theorem,  according  to  which  all  the  mid¬ 
points  of  parallel  chords  cutting  a  conic  section  lie  on  a  straight 
line.  The  parallel  chords  can  be  implemented  in  real  time  as 
camera  scan  lines,  and  the  straight  line  formed  by  their  mid¬ 


points  can  be  detected  by  standard  methods  such  as  the  Hough 
transform,  which  work  much  more  efficiently  on  a  straight  line 
than  on  the  original  conic.  In  short,  the  problem  of  conic  detec¬ 
tion  is  transformed  by  this  method  into  a  problem  of  straight 
line  detection. 

Unlike  the  usual  segmentation  method,  this  one  is  projec- 
tively  invariant.  The  conic  sections  remain  conic  sections  after 
a  projective  transformation.  The  break-points,  i.e.  the  intersec¬ 
tions  between  the  conic  sections,  also  do  not  move  and  corre¬ 
spond  to  the  same  points  on  the  curve,  when  the  segmentation 
is  done  in  a  transformed  coordinate  system  (within  the  toler¬ 
ance  limit).  Thus,  the  curve  can  be  segmented  invariantly  into 
a  string  of  conic  sections. 

It  remains  to  find  the  invariants  of  these  conic  sections. 
Here  we  can  make  use  of  the  classical  theory  of  algebraic  invari¬ 
ants  [Winger,  1962].  This  theory  is  concerned  with  finding  the 
invariants  of  algebraic  forms  in  homogeneous  coordinates.  By 
algebraic  form  one  means  a  homogeneous  polynomial  in  several 
variables.  We  are  now  interested  in  second  order  (quadratic) 
forma,  which  can  be  written  as 

/  =  Y2aidx'xl 

»,} 

where  i  =  1,2,3  in  the  plane  and  «  =  1,2, 3, 4  in  3-D.  This 
can  be  interpreted  as  a  matrix  at  J  multiplied  by  the  vectors  Xi, 
xj .  Without  loss  of  generality,  this  matrix  can  be  assumed  to 
be  symmetric.  Equating  the  form  /  to  zero  gives  an  algebraic 
representation  of  a  conic  section  (quadric  surface).  We  then  have 

/  =  X>o*V=o 

This  is  a  simple  case  of  the  general  $  of  Sec.  2.2.  Under 
projective  transformation  of  the  coordinates  x,,  the  form  of  the 
above  equation  does  not  change,  but  the  coefficients  a,i}  do. 
The  new  coefficients  can  be  easily  found  by  expressing  the  old 
coordinates  in  terms  of  the  new  ones  and  substituting  in  the  form 
/.  However,  some  combinations  of  the  coefficients  are  preserved, 
and  are  thus  invariants.  We  will  now  list  some  of  the  invariants. 

A  single  quadratic  form,  namely  each  conic  (quadric),  has 
one  invariant,  the  determinant 

*  =  KjI 

This  invariant  is  of  weight  two,  i.e. 

i=dJ J 

where  J  is  the  Jacobian  of  the  projective  transformation.  The 
Jacobian  is  the  same  for  all  the  conic  sections,  so  it  is  easy  to 
eliminate,  e.g.  by  choosing  one  of  the  conics  and  dividing  the  d 
of  all  other  conics  by  the  one  of  that  particular  conic.  Thus,  from 
say  k  conics,  we  have  so  far  obtained  k  —  1  absolute  invariants 
(i.e.  of  weight  zero). 

Next,  ones  has  simultaneous  invariants,  i.e.  one  that  are 
made  up  of  the  coefficients  of  two  or  more  forms.  Two  conics 
[quadrics]  a,ji'x’  =  0  and  t.  yx'i’  =  0  have  the  two  simultane¬ 
ous  invariants  of  weight  two 

a  =  A''’birJ,  0  =  B'!a,x, 

where  A''} ,  are  the  cofactors  of  the  matrices  a,  ; ,  bt  },  and 
a  summation  over  i,  j  is  implied  (by  the  Einstein  summation 
convention),  i.e.  when  i,j  are  in  both  sub-  and  superscripts). 
These  are  in  addition  to  the  two  single  conic  invariants,  namely 

a 


=  KjI.  »  =  IM- 


rtjtt 


i.  •  v't  .Vl'- 


One  can  show  that  all  other  invariants  of  two  conics  can  be 
expressed  as  polynomials  in  the  above  four. 

We  can  take  pairs  of  our  string  of  conic  sections,  and  find  the 
simultaneous  invariants  of  each  pair  at  a  time.  We  thus  obtain 
k(k  -  l)/2  of  these  invariants.  They  cannot  be  all  independent, 
however,  as  the  above  number  is  greater  than  the  number  of 
coefficients  of  the  k  conics,  which  is  linear  in  k.  Thus,  a  better 
strategy  would  be  to  calculate  the  simultaneous  invariants  of 
pairs  of  adjacent  conics  only. 

One  can  go  on  to  find  simultaneous  invariants  of  more  than 
two  forms,  but  again  we  will  encounter  interdependency.  The 
Clebsch-Gordan  theorem  states  that  a  set  of  algebraic  forms  has 
a  finite  and  complete  basis  of  independent  invariants,  and  any 
polynomial  function  of  these  basic  invariants  is  also  an  invari¬ 
ant.  It  may  be  of  theoretical  interest  to  try  and  find  all  the 
independent  invariants  of  a  set  of  conics  [quadrics] .  For  practi¬ 
cal  purposes,  however,  the  above  method  of  deriving  invariants 
seems  sufficient. 

Besides  their  role  as  invariants  in  their  own  right,  the  alge¬ 
braic  forms  are  a  good  choice  as  a  basis  functions  for  the  decom¬ 
position  used  in  the  differential  invariant  theory  of  the  previous 
section.  In  that  case,  we  have  several  basis  function  mingling  at 
each  point,  while  in  this  section  we  have  one  basis  function  for 
each  curve  segment.  The  decomposition  of  a  curve  by  the  usual 
splines  as  basis  functions  is  acceptable,  even  though  it  is  not  in¬ 
variant,  because  it  is  only  an  intermediate  step  in  obtaining  the 
invariants.  But  more  accurate  results  will  probably  be  achieved 
if  the  decomposition  is  also  invariant.  Thus  “invariant  spline” 
polynomials  are  of  interest.  They  are  different  from  the  ordinary 
ones  by  being  multidimensional  homogeneous  polynomials,  but 
the  usual  spline  theory  can  easily  be  extended  to  them. 

7.  DIFFERENTIAL  INVARIANTS  OF  SURFACES 


approaching  Px  on  the  surface  5. 

•  An  asymptotic  curve  on  a  surface  is  one  whose  osculating 
plane  coincides  with  the  tangent  plane  to  the  the  surface  at 
each  point  of  the  curve.  In  Euclidean  space,  this  would  be  the 
curve  having  the  direction  in  which  the  normal  curvature  to  the 
surface  vanishes,  but  this  definition  is  not  projectively  invariant. 
Asymptotes  will  be  used  in  the  sequel  as  the  coordinates  relative 
to  which  the  invariants  are  calculated.  Examples  of  asymptotes 
are  the  generating  '  nes  of  cylinders  or  cones. 

•  A  developable  surface  is  the  locus  of  the  tangent  lines  to 
a  given  curve.  These  special  surfaces  will  be  excluded  from  the 
general  treatment  that  follows  as  they  need  special  treatment. 

•  A  surface  is  ruled  if  through  every  point  on  it  passes  a 
straight  line  which  lies  entirely  on  the  surface. 

•  A  net  is  made  up  of  two  families  of  curve,  both  covering 
the  whole  surface. 

We  define  a  surface  in  3-D  space  with  the  help  of  the  four 
dimensional  projective  vector  x  =  x,,  i  =  1,2, 3, 4,  so  that 


or  equivalently 


z(u,  v), 


=  *.(“.*0. 


where  Xi(u>  ti)  are  analytic  functions  of  the  two  parameters  u,  u. 

One  can  derive  an  equation  for  the  asymptotic  net  of  the 
surface,  based  on  the  first  and  second  derivatives  of  z  with  re¬ 
spect  to  u,  v ,  as  follows.  One  first  defines  the  three  determinants 
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We  turn  now  to  the  description  of  invariants  of  surfaces  in  3- 
D  space.  This  section  will  summarize  the  mathematics,  and  the 
next  one  will  discuss  vision  applications.  The  algebraic  theory, 
i.e.  the  theory  of  surfaces  that  can  be  segmented  into  quadric 
patches,  is  essentially  similar  to  the  case  of  plane  curves  and 
has  already  been  described  in  the  last  section.  The  differential 
and  integral  theories,  however,  are  more  involved.  It  would  be 
desirable,  and  is  theoretically  possible,  to  derive  invariants  and 
covariant  as  functions  of  any  given  parametric  representation  of 
a  surface.  However,  this  has  proven  to  be  a  very  tedious  task,  so 
surface  invariants  have  been  derived  only  in  specialized  systems 
of  coordinates.  This  does  not  detract  much  from  the  generality 
of  the  invariants,  because  these  special  coordinate  systems  are 
defined  in  an  invariant  way  and  can  be  derived  from  whatever 
coordinate  system  one  is  initially  given.  One  does,  however, 
need  to  invest  some  effort  in  deriving  these  invariant  coordinates, 
and  perhaps  the  vision  application  will  provide  the  impetus  to 
develop  general  expressions  for  the  invariants,  that  do  not  use 
any  particular  coordinate  system. 

We  first  need  some  definitions  adapted  from  classical  dif¬ 
ferential  geometry.  One  has  to  be  careful,  however,  to  make 
use  only  of  projective  invariant  terms.  Thus  one  deals  with 
lines,  planes,  tangents,  intersections,  but  not  with  curvatures, 
distances  or  angles. 

•  A  tangent  line  at  a  point  Px  on  a  curve  C  is  a  line  passing 
through  Px  and  one  neighboring  point  that  approaches  Px  along 
the  curve  C  (to  an  infinitesimal  distance). 

•  An  osculating  plane  of  a  curve  is  a  “higher  order”  tangent, 
in  the  sense  that  it  has  more  contact  points  with  the  curve.  An 
osculating  plane  through  a  point  Px  on  a  curve  C  is  a  plane 
passing  through  the  point  Px  and  two  neighboring  points,  ap¬ 
proaching  Px  along  C .  It  can  be  regarded  as  the  plane  most 
closely  approximating  the  curve  C  in  the  vicinity  of  Pz. 

The  osculating  plane  of  a  straight  line  is  undetermined. 

•  A  tang'nt  plane  to  a  surface  S  at  a  point  Px  is  a  plane 
passing  through  Px  and  two  neighboring,  non-collinear  points 


The  equation  of  the  asymptotes  can  now  be  written  as 

Ldu2  +  ZMdudv  +  Ndv2  =  0  (7.1) 

This  equation  means  that  for  any  two  infinitesimally  close  points 
lying  on  the  asymptotic  line,  the  difference  du  in  their  u  coor¬ 
dinate  can  be  calculated,  given  the  difference  dv  in  their  v  co¬ 
ordinate,  and  vice  versa.  The  equation  is  quadratic,  with  the 
discriminant  NL  -  M2.  We  exlude  the  case  in  which  this  dis¬ 
criminant  vanishes,  and  thus  ensure  having  two  distinct  roots  of 
this  quadratic  equation  (not  necessarily  real).  In  other  words, 
for  some  increment  dv  between  two  points  on  an  asymptote,  we 
have  two  possible  increments  du.  Thus,  we  have  two  distinct 
families  of  asymptotic  curves,  which  can  be  used  as  a  coordinate 
net. 

The  vanishing  discriminant  has  a  geometric  meaning,  as 
follows: 

Theorem.  A  surface  is  developable  if  and  only  if  N  L-  M2  =  0. 

Thus,  every  non-devclopable  surface  has  two  families  of 
asymptotes  forming  a  net. 

We  want  to  transform  from  the  u,v  coordinates  to  the 
asymptotic  ones.  The  transformation  equation  of  the  various 
quantities  will  be  useful  when  going  over  to  the  asymptotic  co¬ 
ordinate  system,  and  will  also  help  demonstrate  some  invariance 
properties.  We  first  deal  with  a  general  coordinate  transforma¬ 
tion  of  (u,u)  to  the  coordinates  ( Q,  fj) ,  i.e. 

u  =  u(u,v),  0  =  ii(u,v)  (J  =  uuCv  -  fi„Ou  -f  0)  (7.2) 

By  standard  methods  it  is  easy  to  show  that  the  derivatives 
transform  as 


*  ^  "w*"  -•  *j»  " f  ’  J*  ”  .*•  ,*  ’  »  "  “j*  "/  ' 

•  1 Vv^O  -/'w  •/  »/  **1 *J*  *  *  ^ 


SSI 


Xu  =  XqUu  +  X0VU, 

Xu  —  Xfltiu  ~t*  X<,UU, 

Xuu  —  "4"  2x0<|UuVu  "h  XgpVu  “I-  XqUuu  *4*  XgUuu, 

-  V'3) 

Xuu  =  XooUuUu  +  X(jo  (UuVv  +  U„Uu)  +  X0t|t)uv„  +  X0Uu„ 

+  X0Votl, 

Xuu  =  +  2x00u„iiu  +  x00Cu  +  xaaw  +  x0t;„u- 

and  the  quantities  M,  N,  L  transform  as 
L  =  Lv\  +  2  Muava  +  Nvl, 

M  =  Luava  +  M(uot)0  +  ucfo  +  Nvav0, 

N  =  Lul  +  2Mucve  +  Nvl 
From  the  last  equations  it  is  easy  to  show  that 

LAT  -  M3  =  -^(LN-M2). 

Thus,  an  equation  LN-Af 3  =  0,  if  it  holds,  is  invariant,  which  is 
not  surprising  since  the  property  of  being  developable  is  invari¬ 
ant.  The  asymptotes  are  also  invariants  because  their  equation 
(7.1)  is. 

We  can  now  choose  u,  u  as  the  asymptotes.  It  is  of  inter¬ 
est  to  find  an  explicit  equation  for  these  u,  0,  in  contrast  to  eq. 
(7.1)  which  gave  the  asymptotes  as  increments  in  the  old  coor¬ 
dinates  du,dv.  The  fact  that  the  new  coordinates  coincide  with 
the  asymptotes  makes  ti  constant  along  one  asymptote  and  v 
constant  along  the  other.  In  terms  of  u,v  this  yields 

du  =  uudu  +  u„dv  =  0 
dO  =  0udu  +  Su  dv  =  0 

This  relation  means  that  the  ratio  of  the  derivatives  u„/tiu  is  the 
negative  inverse  of  the  increments  du/dv,  Substituting  this  in 
equation  (7.1)  for  the  asymptotes  we  find  a  differential  equation 
for  u,  0: 

N u)3  -  2MuiutD„  -I-  Lwl  =  0  (7.5) 

where  w  can  be  either  u  or  0.  Again,  a  quadratic  equation. 
So,  the  asymptotic  coordinates  u,  v  on  the  surface  can  be  found 
either  by  solving  equation  (7.5)  or  by  using  equation  (7.1)  for 
the  increments  of  u,  v  along  the  asymptotes,  which  may  be  easier 
in  practice. 

From  now  on  we  shall  work  with  asymptotic  coordinates 
only,  so  we  can  drop  the  bars  and  denote  them  by  u,  v. 

We  are  now  in  a  position  to  develop  the  theory  of  invariants 
along  lines  similar  to  those  used  for  curves.  First,  one  has  to  find 
the  differential  equation  satisfied  by  the  surface.  The  following 
theorem  holds: 

Theorem.  Every  non-developable  surface,  parametrized  by  its 
net  of  asymptotes,  satisfies  a  set  of  two  equations  of  the  form 

x„u  "t-  2axu  -f-  26x„  -f*  cx  —  0, 

(7.6) 

x„„  +  2a'x„  +  2  fc'x„  +  c'x  =  0. 

whose  coefficients  are  scalar  functions  of  u,  v  A  non-degenerate 
surface  will  not  satisfy  any  equation  of  the  form 

Ax„,  „  +  Bxu  +  Cx„  +  Dx  =  0  (77) 

whose  coefficients  are  not  all  zero. 

Conversely:  the  solution  of  such  a  system  of  equations  is 
a  surface  that  is  determined  up  to  a  projective  transformation, 
't  is  non-developable,  does  not  degenerate  to  a  curve,  and  the 
parameters  u,  v  are  along  its  asymptotes. 

Thus,  the  above  set  of  equations,  with  some  particular  coef¬ 
ficients,  identifies  one  particular  class  of  surfaces,  differing  only 
by  projectivity.  Studying  the  projective  invariant  properties  of 
a  surface  amounts  to  studying  its  set  of  equations  as  written 
above. 


For  completeness,  we  mention  that  the  coefficients  in  the 
above  equations  can  not  be  completely  arbitrary,  but  are  subject 
to  the  integrability  conditions 

a„  -  b'u  =  0, 

°!iu  +  eu~  la'a<i  -  2aaL  -  (“vv  +  2i>,a„  -  26a'„  -  4a'6„)  =  0, 

6uu  +  Cu  -  266'„  -  2fc'6u  -  (6'uu  +  2a6^  -  2a'6u  -  4ha'J  =  0, 

e'uu  -  4ca(,  -  2a'cu  +  2 ac'u  -  (c„„  -  4c'bv  -  2br.'v  +  26'c„)  =  0. 

As  in  the  case  of  the  curve,  the  projective  invariance  of 
the  differential  equation  is  not  enough  for  our  purposes.  We 
need  invariance  to  transformation  of  the  parameters  along  the 
asymptotes,  and  invariance  to  multiplication  of  the  components 
X{  by  a  common  factor  A(u,u). 

A  change  in  the  parametrization  of  the  asymptotes,  without 
changing  the  net  itself  (which  is  the  only  change  we  will  dare  to 
make)  can  be  represented  as 

u  =  a(u),  v  =  ff(v)  (7-8) 

where  a(u)  depends  on  u  only,  and  a(v)  depends  on  v  only.  We 
can  then  define  the  “semi-invariants” 


/  =  c  —  au  —  o3  —  2bb', 


and  “semi-covariants” 


b'  -  b'2  -  2aa' . 


z  —  x  „  +  ax, 

/>=*„  +  b'x, 

a  =  x„u  +  b'xu  +  ox„  +  ^(a„  +  b'u  +  2aJ>')x 

With  these  notations,  we  can  now  write  the  (relative)  in¬ 
variants  as 


*  -  f>2(f  +  6„)  -  -6iuu  +  —  t3, 

k  =  a'3(9  +  a')-IaV„„  +  Aa'3 

Upon  the  above  transformation  of  coordinates,  these  transform 


t  0lh  a2uk 

h  =  -JT'  k  =  -pT’ 

U  rtl 

All  other  invariants  can  be  derived  from  those  above  and 
their  derivatives.  Absolute  invariants  are  easily  obtained  from 
the  relative  ones. 

More  invariants  can  be  obtained  by  the  following  process. 
Define  the  invariants 

A  =  a'b2 ,  B  =  a'2b,  H^a'h,  K  =  bk 


which  transform  as 


and  the  following  operators: 


a?i 


Then  the  following  functions  are  also  invariant 


The  process  can  be  repeated  iteratively. 

The  basic  covariants  are 

x,  6 Az  +  Aux,  6Bp  +  Bvx,  a  -  ap  —  b'z  +  ab'x.  (12) 

All  other  covariants  can  be  obtained  from  the  above  (and  their 
derivatives)  and  invariants. 

For  most  surfaces,  the  four  invariants  a',  b,  k,  h  retain  all 
the  information  contained  in  the  surface  equations  (7.6).  More 
accurately,  one  can  state  a  “fundamental  theorem”  regarding 
surface  invariants: 

Theorem.  The  four  invariants  a',  b,  h,  k  determine  a  non  devel¬ 
opable  surface  up  to  a  projective  transformation,  if  a'  and  b  are 
not  zero.  If  either  a'  or  b  is  zero,  the  surface  is  ruled.  If  both 
vanish,  the  surface  is  a  quadric. 

(A  quadric  is  a  “doubly  ruled”  surface.  All  the  above  in¬ 
variants  vanish  in  this  case.)  There  is  no  need  to  worry  about 
these  special  cases.  A  theory  for  ruled  surfaces  and  for  devel- 
opables  has  been  developed  along  lines  similar  to  the  above,  with 
a  slightly  different  form  of  the  differential  equations.  For  a  qua¬ 
dric,  the  algebraic  invariants  discused  earlier  form  the  complete 
theory. 

We  conclude  this  account  of  the  classical  theory  of  differ¬ 
ential  invariants  of  surfaces  by  summarizing  the  essential  steps 
involved. 

(1)  Find  the  determinants  L,  M,  N  of  the  surface’s  deriva¬ 
tives. 

(2)  Solve  the  first-order  equation  (7.5)  (or  (7.1))  for  the 
asymptotic  coordinates. 

(3)  Express  the  surface  derivatives  in  terms  of  the  asymp¬ 
totic  coordinates  u,v,  substitute  in  the  differential  equations 
(7.8),  and  solve  the  resulting  algebraic  equations  for  the  param¬ 
eters  a,  ■  •  ■ ,  c'. 

(4)  Derive  the  invariants  and  covariants,  as  functions  of  the 
parameters  found  in  (3),  using  the  given  formulas. 


8.  SURFACE  INVARIANTS  IN  VISION 

While  the  projective  theory  of  space  curves  can  be  of  great 
value  in  both  the  3-D  reconstruction  and  the  recognition  prob¬ 
lems,  the  theory  for  surfaces  is  valuable  primarily  when  the  3-D 
surface  is  already  known,  either  by  earlier  reconstruction  from 
2-D  images,  or  if  we  have  range  data.  Its  advantage  is  freeing 
us  from  the  need  to  find  visible  contours  on  the  surface. 

In  practical  applications,  the  steps  described  above  for  de¬ 
riving  invariants  are  easier  described  than  implemented.  Even 
after  deriving  the  invariants  at  each  point,  we  cannot  match 
them  point  by  point,  as  we  do  not  know  the  point  correspon¬ 
dence.  Some  of  these  methods  are  direct  generalizations  of  those 
used  for  curves,  but  some  are  special  to  3-D. 

Step  (1)  in  the  summary  of  the  last  section,  namely  obtain¬ 
ing  derivatives,  is  similar  to  that  for  curves.  It  can  be  done  by 
approximating  the  surface  in  a  way  which  relies  on  information 
about  the  surface  taken  from  a  reasonably  wide  region  around 
a  point,  rather  than  local  information.  Thus,  instead  of  trying 
to  differentiate  the  surface  depth  function,  for  instance,  we  will 
first  approximate  this  function  by  splines  or  “invariant”  spline 
functions,  and  differentiate  this  approximation. 

Step  (2),  not  relevant  for  curves,  is  finding  the  asymptotic 
lines.  To  do  this,  the  differential  equation  (7.5)  or  (7.1)  has  to  be 
solved.  Since  it  is  a  first  order  PDE,  it  can  solved  by  standard 
methods,  e  g.  by  following  it  characteristics  [John,  1982].  These 
characteristics  are  none  other  than  the  asymptotic  lines. 

We  may  be  able  to  use  a  short  cut,  by  using  some  metric 
properties  of  the  asymptotes.  For  instance,  the  normal  cur¬ 
vature  vanishes  in  the  direction  of  the  asymptotes.  Also,  the 


asymptotes  bisect  the  angles  between  the  principal  curvature 
directions.  Although  curvature  and  angles  are  not  projective 
invariants,  the  end  result,  i.e.  the  asymptotes,  is. 

Step  (3)  should  be  relatively  easy.  We  need  to  find  the 
surface  derivatives  in  the  new,  asymptotic  coordinates.  But  we 
do  not  need  to  to  find  derivatives  again,  as  we  can  use  equations 
(7.3)  that  transform  the  derivatives  in  one  coordinate  system  to 
another.  Substituting  these  derivatives  in  the  general  surface 
equations  (7.6),  we  obtain  a  simple  linear  algebraic  system  of 
equations  for  the  six  unknowns  a,...,c‘.  We  have  two  vector 
equations  (since  x,  zM  etc.  are  known  4-D  vectors),  amounting 
to  eight  equations  in  all. 

In  step  (4)  we  have  to  find  the  invariants  using  the  quan¬ 
tities  a,...,c'  found  in  the  previous  step.  This  involves  differ¬ 
entiation  of  these  quantities.  In  the  curve  case,  this  could  be 
done  analytically,  because  once  we  found  the  coefficients  in  the 
approximating  functions  of  step  (1),  all  subsequent  steps  could 
be  performed  analytically.  In  the  surface  case,  however,  step 
(2),  finding  the  asymptotes,  will  usually  stand  in  that  way  as 
it  involves  numerical  computation.  There  is  no  general  analytic 
solution  to  a  nonlinear  PDE.  Thus,  we  first  have  to  find  the 
quantities  a, . . . ,  c'  at  many  points  on  the  surface  and  then  find 
an  approximating  function  that  can  be  differentiated. 

The  above  procedure  finds  the  invariants  at  any  given  sur¬ 
face  point.  We  now  have  to  use  them  for  the  recognition,  or 
matching  problem.  For  that  we  need  the  absolute  invariants, 
that  do  not  depend  on  either  the  projectivity  nor  the  parametri- 
zation  along  the  asymptotes,  nor  on  a  common  factor  multiply¬ 
ing  the  homogeneous  coordinates.  Of  the  four  basic  invariants 
a',  b,  h,  k,  two  are  absolute.  We  can  obtain  more  absolute  in¬ 
variants  by  the  process  indicated  above,  if  necessary,  but  this 
will  require  more  differentiation.  In  analogy  with  the  plotting 
method  in  the  curve  case,  we  can  now  map  the  surface  into  a 
fc-D  invariant  space,  each  surface  point  being  represented  by  k 
absolute  invariants.  Since  the  surface  is  determined  by  its  in¬ 
variants  up  to  a  projective  transformation,  the  entity  obtained 
in  the  invariant  space  will  uniquely  represent  the  class  of  surfaces 
differing  only  by  such  projectivity.  Any  other  kind  of  difference 
between  surfaces  will  cause  a  difference  in  their  mapping  in  the 
invariant  space.  A  matching  algorithm  between  two  projectively 
different  surfaces  can  thus  consist  of  simply  calculating  the  im¬ 
age  of  the  mapping  of  two  surfaces  onto  the  invariant  space,  and 
matching  the  two  images.  There  is  no  need  for  a  search  in  any 
parameter  space.  For  a  general  projective  transformation  of  3-D 
shape,  this  could  have  been  a  15-D  search  space! 

The  correspondence  problem  can  now  be  solved  as  a  byprod¬ 
uct.  Surface  points  that  map  to  the  same  place  in  the  invariant 
space  are  likely  to  correspond  to  each  other.  The  mapping  is 
not  one-to-one,  so  some  ambiguities  will  arise,  but  they  can  be 
resolved  by  adding  more  points.  It  may  be  of  interest  to  inves¬ 
tigate  the  minimum  conditions  for  determining  a  unique  point 
correspondence. 

Integral  invariants  can  be  obtained  in  the  same  fashion  as 
for  curves.  The  covariant  integration  coordinates  x,  y,  z  are  now 
functions  of  the  asymptotic  coordinates,  and  the  integral  invari¬ 
ant  M  is  defined  as  ( ds  being  a  surface  element) 

A *  =  J  f(ii)m(i,y,z)ds(i,y,  z) 

where  all  arguments  are  implicit  functions  of  the  parameters 

u,v. 

We  have  a  wider  choice  in  defining  our  covariant  coordi¬ 
nates!,!/,  Z.  According  to  eq.  (7.12),  we  have  four  sets  of  vector 
covariants,  including  the  asymptotic  coordinates  x,  themselves, 
each  of  which  can  be  defined  as  our  integration  coordinates  by 
dividing  the  first  three  components  of  the  vector  by  the  fourth. 
In  this  way,  these  components  become  invariant  to  multiplying 
x,  by  a  common  factor  and  M  becomes  an  absolute  invariant, 


raw 


for  every  measuring  function  m(±,  y,  z).  By  choosing  convenient 
measuring  functions,  such  as  powers  of  £,  j),  i,  one  can  charac¬ 
terize  the  surface  invariantly  by  a  small  number  of  properties 
(features).  As  for  curves,  one  can  generalize  the  treatment  of 
surfaces  by  the  use  of  extrinsic  invariants,  such  as  shading,  color, 
or  texture  characteristics.  The  possibility  of  using  densities  also 
extends  easily  to  surfaces. 


9.  SUMMARY 

We  have  surveyed  theories  of  projective  invariants  and  sug¬ 
gested  methods  of  applying  them  in  vision.  The  vision  problems 
that  can  be  solved  from  these  theories  are  very  general  and  fun¬ 
damental  ones,  such  as  3-D  reconstruciton  from  2-D  images  and 
matching  of  images  observed  from  different  points  of  view.  Un¬ 
like  other  method,  we  did  not  have  to  deal  with  the  difficult 
problems  of  search  in  parameter  space  and  of  finding  point  cor¬ 
respondence.  In  subsequent  papers  two  lines  of  development  will 
be  explored:  (i)  implementation;  (ii)  finding  smaller  subgroups 
of  the  projective  transformation  (such  as  Euclidean  ones),  which 
have  their  own  invariants,  in  addition  to  the  general  projective 
ones. 
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Abstract.  We  study  the  problem  of  recovering  a  surface  slice  from 
the  shadows  it  casts  on  itself  when  lighted  by  the  sun  at  various 
times  of  the  day.  The  problem  is  formulated  and  solved  in  a  Hilbert 
space  setting.  The  spline  algorithm  interpolating  the  data  that  re¬ 
sult  from  the  shadows  is  constructed.  This  algorithm  is  optimal 
in  terms  of  the  approximation  error  and  has  low  cost.  We  imple¬ 
ment  the  optimal  error  algorithm  and  show  a  series  of  test  runs.  In 
addition  another  modified  version  of  the  algorithm  that  improves 
the  cost  considerably  is  shown.  This  version  is  suited  for  parallel 
computation  with  further  reductions  in  the  cost  of  the  solution. 


1.  Introduction. 

In  computer  vision  the  surface  reconstruction  problem  is  of  crucial 
importance  in  the  process  of  recovering  the  scene  characteristics  from 
the  2-dimensional  image  created  by  the  camera. 

In  the  various  Shape  from  X  algorithms,  different  methods  of  recov¬ 
ering  a  surface  have  been  proposed.  In  this  paper  we  will  use  a  new 
approach  for  the  reconstruction  of  the  surface  shape.  Namely,  we  will 
define  and  solve  the  Shape  from  Shadows  problem.  In  this  problem  the 
shadows  created  by  a  light  falling  on  a  surface  will  be  used  to  recover 
the  surface  itself. 

Very  little  work  has  been  done  using  shadows  for  the  reconstruction 
of  the  surface  shape.  The  only  work  we  are  aware  of  is  the  one  proposed 
in  [5]  where  the  problem  is  solved  using  a  relaxation  method. 

Shadows  are  a  very  strong  piece  of  information.  The  process  that  uses 
shadows  is  not  affected  by  texture  or  by  surface  reflectance.  Further¬ 
more,  our  imaging  system  does  not  need  a  grey  scale  or  color  capabilities; 
it  is  sufficient  for  it.  to  be  able  to  distinguish  between  black  and  white. 
Also,  noise  in  the  form  of  bright  spots  inside  a  dark  area  and  vice-versa 
can  be  filtered  out  easily.  From  the  above,  it  is  evident  that  shadows 
yield  a  powerful  tool  to  be  used  in  the  reconstruction  process. 

Otir  problem  will  be  the  following.  We  have  a  surface  which  is  lighted 
by  a  light  source.  The  light  source  casts  shadows  on  the  surface.  Then 
the  light  moves  to  a  new  position  where  it  casts  new  shadows.  We  collect, 
the  different  images,  of  the  shadowed  surfaces,  at  these  various  times. 
From  those  we  obtain  the  location  of  the  start  and  the  end  of  the  shadow, 
plus  some  additional  information  about  the  surface  function.  Given  any 
series  of  images  containing  shadows,  we  want  to  obtain  an  algorithm 
that  produces  an  approximation  to  the  surface  with  the  smallest  possible 
error 

We  choose  in  this  approach  to  recover  slices  of  the  surface.  A  slire 
is  defined  as  the  intersection  of  the  surface  and  a  plane  of  constant  y 
(Fig  ].)  We  assume  that  the  surface  slice  is  a  function  defined  on  the 
interval  [0, 1],  having  continuous  first  derivatives,  and  second  derivatives 
with  bounded  L2  norm. 

We  propose  an  algorithm  that  recovers  the  surface  and  minimizes  the 
worst  possible  error  within  a  factor  of  2.  The  algorithm  that  minimizes 
the  error  is  the  spline  algorithm.1  We  choose  to  construct  the  spline 

'  W*  wilt  Hefine  the  worat  caae  error,  the  ajtline  algorithm ,  and  all  the  other  needed 
concepts  later. 


algorithm  in  steps  using  a  process  that  converges  to  the  optimal  error 
algorithm.  The  justification  behind  this  stepwise  construction  is  that, 
in  general,  the  optimal  error  algorithm  might  need  many  iterations  to 
construct,  but  in  most  practical  cases  the  initial  version  of  the  algo¬ 
rithm,  or  the  one  resulting  from  a  few  iterations,  already  obtains  the 
optimal  error.  We  therefore  construct  a  process  that  has  low  cost,  while 
achieving  the  smallest  possible  eiror. 


We  have  done  a  series  of  numerical  runs  to  test  the  performance  of 
flie  above  process.  The  obtained  approximations  were  very  close  to  the 
function  from  which  we  obtained  the  data,  and  that  can  be  immediately 
seen  from  the  pairs  of  initial  and  its  reconstruction  that  we  are  supply¬ 
ing.  We  also  propose  a  parallel  implementation  of  the  algorithm  that 
will  considerably  improve  the  running  time. 

The  organization  of  the  rest  of  the  paper  will  be  the  following  :  In 
section  2,  we  will  formulate  the  problem;  we  will  define  a  function  space 
in  which  our  surface  must  belong.  We  will  also  define  more  precisely 
flie  information  that  can  be  extracted  from  the  shadows. 

In  section  3,  we  will  define  the  optimal  error  algorithm.  Wcshow  that 
l his  is  the  spline  algorithm  which  always  exists  and  is  unique.  This  will 
guarantee  that  the  shape  from  shadows  problem  under  this  formulation 
is  well-posed.  The  definition  of  a  well- posed  problem  can  be  found  in  [3, 
10].  In  contrast  to  many  other  shape  from  X  algorithms  our  formulation 
does  not  require  any  regularization  (see  [8,  9j  for  a  review  of  vision 
problems  requiring  regularization.) 

Section  4  deals  with  the  implementation  of  the  optimal  algorithm.  Its 
performance  is  analyzed  in  terms  of  the  error  it  creates  in  recovering  a 
surface.  We  will  show  sample  runs  that  achieve  a  good  approximation 
with  a  small  number  of  data. 

In  section  5  we  discuss  the  cost  of  the  proposed  algorithm.  We  show 
how  to  take  advantage  of  the  structure  of  the  data  and  modify  the  algo¬ 
rithm,  so  that  significant  cost  improvements  are  obtained.  Furthermore, 
the  algorithm  can  now  he  implemented  in  parallel,  resulting  in  an  even 
further  reduction  in  the  running  time. 


2.  Formulation  of  the  problem. 

In  the  introduction  of  the  paper  we  said  t  hat  our  aim  is  to  recover  a  2- 
dimensional  slice  of  a  3-dimensional  surface.  A  2-dimensional  slice  (see 
Fig.  1)  can  he  seen  as  a  function  of  one  variable  /  :  IR  — *  IR  belonging 
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in  the  space  of  functions  Fo-  Our  aim  is  to  obtain  an  approximation 
x  €  Fq  to  our  function  /  €  Fo  using  the  data  that  we  can  derive  from 
the  shadows.  We  want  the  approximation  x  to  be  as  close  to  /  as 
possible. 


2.1  Function  Space. 


fo  =  {/  I  /  :  [0, 1]  —  m,  /'  absolutely  cont.,  ||/"||ij  <  1 } ,  (2-1) 


be  the  space  that  contains  the  functions  /  that  we  want  to  approxi- 
mate.2  3  The  norm  ||  ■  \\l7  is  defined  as  ||/||£,a  =  \J fo  |/(z)|2dx. 

Also,  define  the  bilinear  form  (•,  •)  to  be  such  that, 


</,</>  =  f  /"(*)  g"(x)  dx% 
Jo 


and  the  norm  [ 


to  be  such  that, 


Clearly  {-,  •)  defined  above  is  a  semi-inner  product  and  ||  •  (|  is  a  semi- 
norm.  If  we  pose  the  additional  requirements  /( 0)  =  0  and  /'( 0)  = 
0  on  our  function,  then  (-,)  is  an  inner  product  and  ||  •  ||  a  norm. 
Consequently,  Fo  equipped  with  (•,  •)  is  a  semi-Hilbert  space  or  a  Hilbert 
space  respectively. 


2.2  Information. 


In  the  next  step  we  will  extract  from  the  image(s)  the  information, 
that  is  contained  in  the  shadows,  and  which  will  be  denoted  by  N(f  ).4 
Clearly,  from  the  position  of  the  light  source,  we  can  immediately  obtain 
the  derivative  of  the  function  /  at  the  point  x x,  being  the  beginning 
of  the  shadow  (see  Fig.  2).  We  can  also  obtain  the  difference  between 
t lie  two  function  values  /(*»)  -  /(x»).  at  the  beginning  and  at  the  end 
of  the  shadow  respectively,  given  by  /'(xj)(x,  —  x,).  Assuming  that 
lt{x)  is  the  straight  line  segment  passing  through  the  points  x,*,  x»,  an 
additional  piece  of  information  can  be  obtained,  ft  holds  (see  Fig.  2) 
that, 

/(*)<!,(*),  V*  €[**,*,].  (2-4) 


So.  formally,  the  information  A r(/)  contains  triplets. 


{/'(**)./(*<)  -  /(*.)>/(*)  <  m*)). 


Note  t  hat  the  third  item  in  the  triplet  is  a  consistency  condition. 

In  each  one  of  the  images  in  our  sample  there  are  0.  1  or  more  shad¬ 
owed  areas.  From  each  one  of  those  shadowed  areas  wc  can  obtain  a 
triplet  of  the  form  (2-/)).  If  we  group  all  the  data  resulting  from  this 
sampling  we  obtain  the  vector. 


.V(/)  =  [/■(*,) . /'(*»), /(*l) -/<*,) . 

f<r„)  -  /(*„). /(n )  -  /(*,) . /(x.)  -  /((,„,) . 

/(x2)-/(<mj),..../(x„)-/((mj]T,  (2-6) 


*  1  hr  Imuixl  of  t  ill  ||/"||/,i  is  assunird  without  loss  of  grnrralilv.  However,  as  we 
alre,vly  mrntionrd,  any  fixed  bound  is  equally  good. 

I  he  use  of  the  interval  [0.  l]  is  not  restrictive  either.  Any  interval  [a,  6]  for  some  a 
and  b  is  equally  good. 

1  I  he  ( mirept  of  information  is  considerably  different  from  the  concept  of  data.  The 
data  vector  is  a  vector  of  fixed  values,  while  information  is  an  operator.  We  will  use 
the  term  somewhat  imprecisely.  The  user  is  referred  to  [11,  12,  13,  l>lj  for  a  more 
detailed  discussion  on  the  concept  of  information  operators. 


where  m i, . . .  ,mn  are  the  number  of  points  f,  in  every  interval  [x,-,Xj] 
for  which  (2-4)  holds,  and  mi  +  •  ■  ■  +  mn  =  m.  Clearly,  while  we  know 
•  hat.  there  are  exactly  2n  pieces  of  information  in  the  first  part  of  N{f), 
we  cannot  bound  the  cardinality  of  the  last  part  because,  it  can  be  the 
^  case  that  m  — ►  -foe. 


3.  Solution  of  the  problem  -  The  optimal  algorithm. 

We  now  proceed  to  the  solution  of  the  problem.  We  want,  given 
information  N(f)  to  obtain  an  algorithm  <p  that  will  provide  an  approx¬ 
imation  to  our  function  /.  An  algorithm  is  defined  as  any  mapping  from 
the  space  of  all  permissible  data  vectors  to  the  space  Fq. 


•3.1  Algorithm  error. 

The  error  of  an  algorithm  for  a  given  fixed  function  /  is  given  by 
||/  -  <p(N{f)) ||.  We  would  like  to  know  what  is  the  largest  possible 
<rror  (hat  can  be  made  by  the  algorithm,  i.e.  we  want  the  error  of  the 
algorithm  for  the  worst  possible  function  /. 

DEFINE!  ion  3.1 .  The  worst  case  error  of  an  algorithm  <p  is, 


c(*,N(f))  =  sup  {||/- SP(AT(/))||,  N(f)  =  NU)\. 

/efo 


Clearly,  we  want  an  algorithm  that  minimizes  e(y>,  Ar(/)). 
Definition  3.2.  An  algorithm  that  has  the  property, 


e(v,,N(f))  =  M{e(‘P,N(f))},  V/6F, 


is  called  a  st  rongly  optimal  error  algorithm. 

The  quantity  at  the  right  side  of  (3-2),  i.e.  the  infimum  of  the  error  of 
all  algorithms  solving  the  problem  given  information  N(f),  is  a  property 
<>f  the  problem  itself,  and  does  not  depend  on  the  particular  algorithm 
used  at  any  moment.  This  quantity,  gives  the  inherent  uncertainty  of  the 
problem  for  given  information,  and  is  called  the  radius  of  information . 
<  learly.  the  error  of  the  strongly  optimal  algorithm  equals  the  radius  of 
information.5 


3  2  The  spline  algorithm 

We  propose  the  spline  algorithm  <p9  for  the  solution  of  our  problem. 
Splines  have  been  known  to  give  the  optimal  solution  to  many  interesting 
problems  [1,  2.  6,  7,  II,  12]. 

Definition  3.3.  A  spline  a  is  an  element  in  the  space  of  functions  F0 
such  that, 

(1)  jV(<t)  =  jj. 

(2)  |M|  =  niin/€n{||/||,  N(f)  =  y}. 


The  meaning  of  ( 1 )  is  that  the  spline  must  interpolate  the  data,  and 
(2)  says  that  the  spline  is  the  function  that  minimizes  ||  •  ||.  The  spline 
algorithm  is  the  process  that  constructs  the  spline. 

hi  a  Hilbert  space  setting  one  can  obtain  a  closed  form  for  the  spline 
algorithm.  In  our  particular  case,  the  spline  algorithm  is  given  by. 


p’(x)  =  + 


where  {//,},_!  /Jn  and  =  ,m  are  such  that. 


(X.  -  x)°  -  (x,-|  -  x)^ 


wIhti-  i  =  1 . Ti.  ( x,  —  x)”  =  1  for  x,  >  x  and  0  othorwim,. 


=  (x, 


(x,  -  x)+  -  (x,  -  x)i (X,  -  X,). 


where  i  =  1 . ti.  (x,  —  x)  +  =  x,  -  x  for  x,  >  x  and  0  otherwise,  and 


/'"(x)  =  (I,  -  x)+  -  (x,  -  x)+  -  (x,  -  x)'j(fj  -  X,).  (:i 


<  On-  I»»m1  dial  must  hr  m»*n» ionrd  lin-r  is  that  Ifir  radius  of  informal imi  drsi-rihrs, 
■  is  ac  said  before,  dir  mtirrrn/  imrrr/ainfy  of  ihr  probltrn  and  1ms  a  spe,  ifi<  valu«\ 
sav  }l.  Huwvrr  small  « u  large  H  inay  hr.  there  is  no  prorrdnre  dial  will  guarantee 
••itoi  Irss  than  dial 


A 
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where  i  =  1, . . .  ,n,  and  j  =  1 ,m. 

The  functions  }«=i .  ,m  and  {Aj}>si,  m  are  the  reprtsentcrs  of 
the  functionals  that  construct  the  information  N(f),  properly  modified 
to  have  a  small  area  of  support. 

The  coefficients  a,  and  c>  are  chosen  so  that  the  definition  of  the 
spline  is  satisfied,  that  is,  <ps  interpolates  the  data,  and  also  minimizes 
the  norm  [|  ■  ||.  If  the  area  of  support  of  every  g+  is  disjoint  from  the  area 
of  support  of  every  other  and  furthermore  (gi,gj)  =  £,->,  the  Kronecker 
delta,  then  the  coefficients  a;  are  given  directly  by  the  theory.  This  is 
not  the  case  in  this  setting  where  the  coefficients  a,  are  obtained  by 
solving  a  system  of  linear  equations  and  the  coefficients  Cj  are  obtained 
by  directly  minimizing  ||  •  ||.  We  will  define  the  minimization  problem  in 
section  3.3  and  describe  the  implementation  of  the  algorithm  in  section 
\ 

1  or  the  spline  algorithm  the  following  very  strong  tneorem  holds  (7, 

12]. 

Theorem  3.1.  Let  Fo  be  a  Hilbert  space,  f  €  Fo  and  information 
y  —  A’(/).  Then ,  the  spline  algorithm  interpolating  the  data  y  exists, 
is  unique,  and  achieves  error  at  most  twice  the  radius  of  information. 

From  Theorem  3.1  we  can  obtain  two  very  important  results.  First, 
our  problem  under  the  proposed  formulation  is  well-posed.  This  prop¬ 
erty  (3.  10]  is  always  desirable  when  solving  a  problem.  Computer  vision 
problems  tend  to  be  ill-po6ed  and  considerable  effort  has  been  spent  by 
the  vision  community  towards  the  correct  formulation  that  will  yield 
well -posed ness  (See  [4,  8,  9]  for  a  survey.) 

Second,  the  spline  algorithm  has  a  worst  case  error  that  is  within  a 
factor  of  2  from  the  radius  of  information.  The  algorithm  that  achieves 
that  is  called  almost  strongly  optimal  [11].  If  the  problem  is  linear 6 
then  the  spline  algorithm  tpa  has  a  worst  case  error  equal  to  the  radius 
of  information  and  is,  therefore,  the  strongly  optimal  algorithm. 

The  shape  from  shadows  problem  is  linear,  only  if  the  cardinality  of 
•  he  second  part  of  the  information  JV(f)  is  0,  i.e.  m  =  0.  If  m  >  0  then 
r5  is  not  strongly  optimal,  but  almost  strongly  optimal  as  derived  from 
I'lienrem  3.1. 7 

3  3  The  minimization  problem. 

We  want  to  minimize  ||<r|j2  where  <r  is  given  by  (3-3).  We  can  write, 

IMI’  = 

n  u  n  k 

•  islijsi  «=1  >=l 

k  k 

>|  =  1  >a  =  l 

=  «TGo+2  5TPTc  +  (THc,  (3-7) 

wlirn-  G  =  ,2„.  P  =  {<Aj. »,-)}>= I.  ,m >  and 

*=1,.  ,2n 

Also. 

n  k 

(*■!/>)  =  2lai(gt*9t)  +  2Jf>(A>.^»)  =  y, 

•=»  >=i 

— ^>  G  d  +  P  c  =  y 

=>  3  =  G-'(y-Pc).  (3-8) 


(tr.h,)  =  £<i, (ff,, h.)  +  '£2cj(hj,h,)  <  A, 
<='  ;=i 

-  P  7  •  H  r  <  A 

'^5  PTG-'y- PtG-'P  c+  H  r  <  A 
=>  (H  -  PTG-‘P)f  <  A- PtG-‘j7. 


\;i<  t  -b'fimtion  of  a  linear  problem  we  [7,  12.  13,  14]. 

•  •j  -tmifi]  algorithms  for  non-linear  problems  are  not  known  in  general,  and 
Ht»-v  <  an  tie  very  diffirult  and  expensive  to  ralmlate. 


Now,  if  we  substitute  (3-8)  for  a  in  (3-7)  we  obtain, 

Ml2  =  ?tH  c  +  2  (G-1  (y- P c])T  P  c 

+  (G_1  (y  —  Pc))T  G  (G-‘(y-Pc)) 

=  cT(H  — PTG-1P)c+yTG-'y.  (3-10) 

Since  yTG-1y  has  a  known  fixed  value  for  a  given  problem  we  have 
to  minimize  cT  (H  —  PtG-1P)  c  given  the  conditions  in  (3-9).  This  is 
a  quadratic  minimization  problem  that  can  be  solved  using  a  standard 
method. 

The  cardinality  of  the  non-linear  part  of  the  information  is  not  fixed 
and  the  choice  of  the  value  of  m  must  be  performed  carefully,  especially 
since  the  minimization  process  is  costly.  One  may  choose  to  always 
ignore  the  non-linear  information  encoded  in  (2-4).  On  the  other  hand 
non-linear  information  of  some  fixed  cardinality  may  be  always  included, 
regardless  of  whether  the  constraints  in  (2-4)  are  violated  or  not. 

We  choose  an  intermediate  approach  which  will  always  assure  that  (2- 
4)  holds,  and  at  the  same  time  will  minimize  the  cost  of  the  algorithm 
(see  sections  4  and  5.) 

4.  Application  of  the  algorithm  -  Numerical  runs. 

The  spline  algorithm  of  section  3  has  been  applied  and  its  performance 
has  been  tested  in  practice. 

4.1  Algorithm  implementation. 

In  our  implementation  the  calculation  of  the  spline  algorithm  pro¬ 
ceeds  in  steps.  From  our  early  experience  with  experimental  systems 
we  have  concluded  that  except  very  few  cases,  the  non-linear  part  of 
the  information  is  not  needed.  This  means  that  the  approximation  pro¬ 
duced  by  <ps(x)  using  information  N(f)  with  m  =  0  does  not  violate 
the  constraints  (2-4). 

Stage  1: 

Therefore,  we  begin  the  implementation  of  the  6pline  algorithm  by 
assuming  that  m  =  0  and  we  will  first  construct  the  values  of  the  coef¬ 
ficients  a,  .  This  is  done  by  solving  the  system  of  equations, 

G  a  -  y,  (4-1) 

where  G  =  {(0t.0>)}*"=l  and  {^a- }•— i ,  ,2n  are  given  by  (3-4)  and 
(3-5).  The  system  is  solved  by  a  direct  method  without  the  need  for 
pivoting  since  it  is  symmetric,  positive  definite  and  has  a  nice  structure 
that  reduces  the  number  of  calculations. 

As  a  next  step,  we  use  the  computed  values  of  the  a<’  s  to  construct 
the  spline  algorithm,  and  we  plot  its  graph. 

Third,  we  check  to  see  whether  the  non-linear  constraints  are  violated. 
This  is  done  on  line  while  we  are  plotting  <p*(x). 

If  the  non-linear  constraints  (2-4)  are  not  violated,  which  as  we  men¬ 
tioned  before  is  usually  the  case,  we  do  not  need  to  do  anything  else.  We 
have  already  obtained  the  approximation  y?J(r)  to  the  function  /  and  we 
have  plotted  it.  Also,  since  we  have  not  used  the  non-linear  part,  of  the 
information,  the  problem  is  linear  hence  the  spline  algorithm  achieves 
the  radius  of  information. 

Stage  2: 

If  the  constraints  (2-4)  are  violated,  then  we  do  not  have  a  sufficiently 
good  approximation,  which  means  that  we  must  obtain  the  coefficients 
Cj  of  (3-3).  To  do  so,  we  have  to  solve  the  minimization  problem  derived 
in  section  3.3. 

We  will  consequently  proceed  as  follows.  We  will  take  a  few  points 
in  the  shadowed  intervals  where  the  constraints  are  violated.  For  these 
points  we  solve  the  minimization  problem.  Then  we  check  again  for 
violations  of  the  non-linear  constraints.  If  there  are  violations  we  repeat 
Stage  2.  We  select  a  few  more  points  from  the  interval(s)  where  (2-4)  is 
violated,  and  we  add  them  to  the  sample.  The  minimization  is  repeated 
for  the  new  set  of  points  and  the  new  coefficients  are  derived.  At  the 
same  time,  the  a,‘  s  and  the  old  Cj' s  are  modified. 

We  perforin  the  minimization  for  a  few  points  at  a  time  for  various 
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(1)  It  is  a  costly  process  and  we  would  like  to  keep  the  dimensions 
of  the  problem  as  small  as  possible. 

(2)  At  each  new  iteration  we  do  not  need  to  undo  our  previous  work, 
but  we  simply  modify  the  existing  coefficients  while  deriving  the 
new  ones. 

(3)  We  rarely  need  to  use  more  that  one  or  two  points  per  shadowed 
area. 

4.2  Test  runs. 

We  have  constructed  a  broad  series  of  functions  and  we  have  run  our 
algorithm  on  them.  We  started  from  the  smoothest  possible  function 
which  is  a  trigonometric  one.  Fig.  3  shows  the  graph  of  this  function, 
and  the  graph  of  the  approximating  spline.  A  broken  line  is  used  to 
draw  the  function  and  a  solid  one  is  used  to  draw  the  approximating 
spline. 


Figure  3. 

The  information  we  have  used  has  been  obtained  for  only  2  different 
light  positions,  and  already  yields  a  very  close  approximation.  If  we  add 
samples  from  another  2  light  positions  for  a  total  of  4,  the  approximation 
is  so  close  to  the  function  /  that  the  discrepancy  cannot  be  observed. 

From  the  quality  of  this  approximation  we  may  assume  that  smooth 
functions  can  be  approximated  very  well.  We  will  move  now  to  the  other 
side  of  the  spectrum  which  contains  functions  with  as  few  derivatives 
as  possible.  This  proves  actually  to  be  the  most  difficult  case.  In  our 
setting  the  most  irregular  functions  are  the  ones  that  have  continuous 
first  derivatives  and  discontinuous  second  ones.  Piecewise  quadratic 
polynomials  with  different  second  derivative  from  piece  to  piece,  are 
functions  of  this  type.  We  built  many  of  these  functions  with  as  many 
as  40  different  pieces  each 

Additionally,  in  order  to  magnify  the  visual  efTect  of  any  discrepancy 
between  the  function  and  the  approximation  we  relaxed  the  assumption 
||/"||  <  1  that  we  posed  in  our  space  definition.  Otherwise,  the  difference 
bet  ween  /  and  would  not  be  visible  in  any  test  run  we  would  choose 

to  show. 


In  Figure  4  we  can  see  the  approximation  to  a  function  consisting  of 
10  piecewise  polynomials  of  degree  2  The  information  used  has  been 
obtained  from  fi  different,  lighting  angles. 

If  ran  again  be  seen  that  the  function  /  and  the  approximating  spline 
■f*  almost  coincide. 


We  will  now  show  one  of  the  most  difficult  functions  we  have  built, 
together  with  its  approximating  spline.  The  function  consists  of  40 
piecewise  polynomials  of  second  degree,  and  has  very  large  jumps  in 
its  second  derivative,  hence  it  has  large  ||/"||.  The  information  we  have 
used  to  compute  <£>*(*)>  shown  in  Figure  5,  is  derived  by  using  8  different 
light  angles. 


Another  issue  needs  to  be  discussed.  Namely,  in  all  the  above  cases 
the  approximation  to  the  given  function  has  been  constructed  without 
the  use  of  the  non-linear  information  which,  as  we  have  mentioned,  is 
usually  the  case.  We  will  contrive  a  case  where  the  use  of  the  constraints 
(2-4)  is  needed,  so  that  we  can  exhibit  the  second  stage  of  the  algorithm 
(3-3). 

When, 

(1)  f"(x)  varies  a  lot  from  one  polynomial  piece  to  the  other, 

(2)  The  sampling  is  sparse, 

(3)  Both  light  positions  are  from  the  same  side  of  the  horizon. 

Then,  the  approximation  given  by  the  first  stage  of  the  algorithm  can 
fail  to  satisfy  (2-4)  (See  Fig.  6.) 


In  the  example  of  Fig.  6  only  2  light  angles  were  used,  both  from  the 
same  side  of  the  horizon,  and  far  apart  from  each  other.  We  therefore 
used  the  second  stage  of  the  algorithm,  we  added  two  extra  points  t\ 
and  <2  an<1  we  included  f{t\)  <  /i (<i ),  /(<2)  <  ^1(^2)  in  N(f).  Then, 
the  constructed  approximation  (Fig.  7)  satisfies  (2-4). 
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5.  Cost  of  the  algorithm  -  Speed  improvements. 

5.1  Algorithm  cost. 

Let  us  now  discuss  the  speed  performance  of  our  algorithm.  The 
spline  algorithm,  as  defined  in  section  3,  is  linear  in  terms  of  its  input. 
Thus,  if  we  knew  the  coefficients  a*  and  Cj  then,  cos i(<p9)  would  be  O(n). 

In  our  case,  the  coefficients  of  the  spline  algorithm  are  not  known, 
and  must  be  constructed.  To  achieve  this  we  must  solve  a  system  of 
linear  equations,  and  sometimes,  a  minimization  problem.  These  costs 
dominate  the  cost  of  the  algorithm. 

In  particular  the  solution  of  the  system  (4-1)  has  a  cost  0(n3).  The 
cost  of  the  quadratic  minimization  is  considerably  higher,  that  is  it  is 
exponential  in  terms  of  the  number  of  non-linear  information  samples 
used  in  each  stage.  In  this  case,  and  since  we  have  observed  that  better 
and  denser  sampling  alleviates  the  need  to  use  the  second  stage  of  the 
algorithm,  we  might  choose  to  increase  the  cardinality  of  the  sampling, 
if  that  can  be  done,  and  solve  a  slightly  larger  linear  problem  instead. 

5.2  Speed  improvements. 

In  section  5.1  we  have  discussed  the  cost  of  a  very  straightforward 
implementation  of  the  spline  algorithm  described  in  section  4.1.  We  now 
show  that  a  slight  improvement  in  the  implementation  of  the  algorithm 
can  yield  a  significant  speedup.  This  speedup  can  be  achieved  only  if 
the  function  we  want  to  recover  can  be  split  in  distinct  sections  that  we 
will  from  now  on  call  valleys.  A  valley  is  defined  by  two  local  maxima 
of  the  function,  but  also  depends  on  the  specific  sampling.  For  example, 
the  function  of  Fig.  1  has  2  valleys.  In  particular,  we  say  that  the 
function  /,  under  some  fixed  sampling,  has  k  valleys  if  we  can  define 
k  partitions  III,  n2, . .  - ,  II*  of  the  functions  {^}l=i, ..,i2n,  given  by  (3-4) 
and  (3-5),  such  that  the  union  of  the  areas  of  support  of  all  the  functions 
in  each  partition  is  disjoint  from  the  union  of  the  areas  of  support  of 
the  functions  in  every  other  partition. 

We  can  detect  the  existence  of  any  number  of  valleys  in  time  O(n)  and 
subsequently,  we  can  solve  k  problems  of  sizes  nj,  n?, . . . ,  n*  respectively, 
instead  of  solving  one  problem  of  size  n,  where  n  =  m  +  r»2  +  •  •  ■  +  n*. 

To  connect  the  pieces  resulting  from  each  of  the  k  problems  we  need 
constant  time  per  problem,  hence  combining  can  be  done  in  time  O(k). 

Therefore,  the  total  cost  of  this  algorithm,  which  we  will  denote 
will  be  O (ku3),  where  v  =  max{nlt.. 

5.3  Parallel  implementation. 

Since  splitting  the  problem  into  individual  subproblems  and  combin¬ 
ing  the  resulting  surfaces  is  straightforward  and  cheap  to  implement, 
one  immediate  extension  to  the  above  set-up  of  the  problem  is  to  assign 
one  individual  subproblem  to  a  different  processor  and  solve  the  initial 
problem  in  a  parallel  or  in  a  distributed  environment. 

Again,  splitting  into  k  subproblems  requires  timeO(n)  and  combining 
the  individual  solutions  into  one  requires  time  of  O(fc).  Then  every 
processor  will  require  time  O(nf),  i  =  l,...,k  resulting  in  a  total 
cost  for  the  parallel  version  \p'p  of  our  algorithm  of  0(i/3),  where  u  = 
max{nj , . .  . ,  nt}. 

Let  us  now  compare  the  three  different  implementations  of  the  spline 
algorithm,  using  a  specific  example.  Assume  we  have  information  N(f) 
of  cardinality  n  =  512. 8  Also  assume  that  the  number  of  valleys  k  is  8. 
We  assume  without  loss  of  generality  that  r»i  =  r»2  =  ■  ■  ■  =  n*  =  u  =  64. 
For  these  values  of  n,  u,  and  k  the  performance  figures  listed  in  Table 
1  are  obtained. 

Algorithm _ Cost  (  Millions  of  ons) 

5123^  134.4 
8  ■  643  a  2.0 
643  ss  0.26 


to  10  different  light  positions.  If  this  is  the  case,  the  only  occasion  when 
we  can  obtain  N(f)  of  cardinality  n  =  512  is  if  fr  is  very  large,  i.e.  if 
we  have  a  big  number  of  valleys.  In  that  case  the  speed  improvements 
should  be  even  larger  than  the  ones  already  exhibited.  Conversely,  if 
the  number  of  valleys  is  small  then  the  size  of  n  is  expected  to  be  low, 
since  it  is  proportional  to  the  number  of  light  positions,  and  will  in  no 
case  reach  the  magnitude  of  the  above  example. 

6.  Conclusion  -  Future  work. 

We  solved  the  problem  of  recovering  a  one-dimensional  surface  slice 
from  the  shadows  it  casts  on  itself  when  lighted  by  a  light  source  posi¬ 
tioned  at  various  locations. 

We  proposed  a  formulation  that  results  in  a  well-posed  problem  and 
we  have  consequently  proceeded  into  solving  it.  We  proposed  an  optimal 
error  algorithm  which  additionally  achieves  a  low  time  cost,  especially 
if  a  clever  but  simple  breakdown  of  the  problem  is  used. 

There  are  many  aspects  of  this  problem  that  can  be  looked  at  in  the 
future.  A  method  for  the  faster  solution  of  the  optimization  problem, 
based  on  its  specific  structure,  is  one  of  them.  On  the  other  hand  it 
would  be  useful  to  quantify  whether  the  use  of  the  non-linear  information 
can  be  avoided.  Our  current  belief  is  that  good  sampling  can  take  care 
of  all  cases.  The  second  stage  of  the  algorithm  might  be  still  useful  in 
cases  where  we  have  to  deal  with  a  fixed,  given,  not  very  appropriate 
sampling. 

Another  natural  extension  to  the  shape  from  shadows  problem  is  to 
try  to  recover  the  whole  3-D  surface  instead  of  recovering  surface  slices, 
as  we  are  doing  in  this  paper.  We  arc  errantly  working  on  this  inter¬ 
esting  aspect. 
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Table  1. 

It  is  apparent  from  the  above  table  that  we  can  obtain  substantial 
speed  improvements  with  very  low  added  overhead.  It  should  addition¬ 
ally  be  noted  that  we  can  obtain  a  good  approximation  using  around  8 

*A  very  high  number  compared  to  the  values  of  n  used  in  the  sample  runs  we  hav< 
shown. 
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Abstract 

Given  the  physical  properties  of  a  class  of  materi¬ 
als,  it  is  possible  to  predict  the  properties  of  images  of 
these  materials.  In  this  paper,  we  examine  the  prop¬ 
erties  of  metals  and  dielectrics  which  allow  them  to  be 
distinguished  in  color  images.  We  begin  by  adopting 
genera!  physical  models  for  the  properties  of  materials 
which  determine  how  they  interact  with  light.  From 
these  models,  we  analyze  the  geometric,  spectral,  and 
intensity  properties  of  the  light  reflected  from  an  arbi¬ 
trary  material.  The  goal  of  this  work  is  to  isolate  those 
properties  of  the  reflected  light  which  provide  evidence 
for  the  classification  of  a  material  as  either  a  metal  or 
a  dielectric.  The  ability  to  achieve  this  level  of  classi¬ 
fication  is  a  first  step  towards  a  more  general  material 
identification  system. 

1  Introduction 

When  an  electromagnetic  wave  is  incident  on  the  sur¬ 
face  of  an  object,  it  will  interact  with  the  material  of 
the  object.  The  nature  of  this  interaction  depends  on 
the  electrical  and  magnetic  properties  of  the  object’s 
material.  For  the  special  case  of  visible  light,  the  rel¬ 
evant  properties  are  called  optical  properties.  In  this 
work,  we  develop  methods  to  predict  information  about 
images  of  a  material  from  the  optical  properties  of  the 
materiad. 

An  important  electrical  property  of  a  material  is  the 
degree  to  which  it  conducts  electricity.  Metals  are  good 
conductors  of  electricity,  while  dielectrics  are  not.  This 
fundamental  difference  causes  metals  and  dielectrics  to 
interact  differently  with  light.  If  we  can  understand 
how  this  difference  in  interaction  causes  correspond¬ 
ing  differences  in  images  of  these  classes  of  materials, 
we  can  derive  methods  to  distinguish  metals  from  di¬ 
electrics  in  images. 

There  are  several  reasons  why  it  is  useful  in  image 
understanding  to  be  able  to  distinguish  metals  from 


dielectrics.  Certainly,  material  identification  is  an  im¬ 
portant  part  of  building  descriptions  of  objects  in  a 
scene.  In  object  recognition  tasks,  material  identifi¬ 
cation  provides  a  valuable  cue  for  recognition.  Aside 
from  this  general  utility,  the  ability  to  distinguish  met¬ 
als  from  dielectrics  is  useful  for  another  reason.  Since 
metals  and  dielectrics  interact  with  light  in  fundamen¬ 
tally  different  ways,  algorithms  have  been  developed  in 
computer  vision  which  apply  to  one  of  these  classes  of 
materials  but  not  to  the  other.  For  example,  certain  vi¬ 
sion  algorithms  require  the  identification  of  a  dielectric 
material  in  a  scene  [4],  [7],  Other  algorithms  are  better 
suited  for  metals,  e.g.  [10].  Before  these  algorithms 
can  be  applied  to  images,  it  is  necessary  to  determine 
if  a  material  in  the  scene  is  a  metal  or  a  dielectric. 

2  Preliminaries 

In  this  section,  we  give  an  overview  of  the  interaction 
of  light  with  matter  and  introduce  much  of  the  termi¬ 
nology  which  will  be  used  in  the  rest  of  this  work. 

2.1  Optically  Homogeneous  and  Inhomoge¬ 
neous  Materials 

It  is  useful  to  divide  materials  into  two  classes  based 
on  their  optical  properties.  This  subsection  defines  op¬ 
tically  homogeneous  and  optically  inhomogeneous  ma¬ 
terials.  The  methods  described  in  this  work  apply  to 
both  kinds  of  materials. 

Optically  homogeneous  materials  have  a  constant 
index  of  refraction  throughout  the  material.  Conse¬ 
quently,  for  an  object  composed  of  a  homogeneous  ma¬ 
terial  in  air,  light  undergoes  reflection  and  refraction 
only  as  it  encounters  a  surface  of  the  object.  Metals 
and  crystals  are  the  most  common  examples  of  homo¬ 
geneous  materials. 

Optically  inhomogeneous  materials  are  composed  of 
a  vehicle  with  many  embedded  colorant  particles  which 
differ  optically  from  the  vehicle.  While  in  the  body 
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Figure  1.  Specular  Reflection 


of  an  inhomogeneous  material,  light  typically  interacts 
with  many  colorant  particles.  The  net  result  of  many 
interactions  is  that  the  light  is  diffused.  Some  examples 
of  inhomogeneous  materials  are  plastics,  paper,  tex¬ 
tiles,  and  paints. 

2.2  Reflection 

When  light  is  incident  on  the  surface  of  an  object,  some 
fraction  of  it  is  specularly  reflected.  A  smooth  surface 
reflects  light  only  in  the  direction  such  that  the  angle  of 
incidence  equals  the  angle  of  reflection  (Figure  1).  The 
properties  of  this  specularly  reflected  light  are  deter¬ 
mined  by  the  optical  and  geometric  properties  of  the 
surface.  For  optically  homogeneous  materials  which 
are  not  transparent,  appearance  is  completely  deter¬ 
mined  by  the  properties  of  specularly  reflected  light. 
For  many  inhomogeneous  materials,  such  as  plastics, 
specular  effects  are  also  significant. 

The  fraction  of  the  incident  light  which  is  not  specu¬ 
larly  reflected  enters  the  body  of  the  material.  For  inho¬ 
mogeneous  materials,  the  body  is  composed  of  a  vehicle 
and  many  colorant  particles.  When  light  encounters  a 
colorant  particle,  some  portion  of  it  is  reflected.  After 
many  reflections,  the  light  is  diffused  and  a  significant 
fraction  can  exit  back  through  the  surface  in  a  wide 
range  of  directions  (Figure  2). 

2.3  Irradiance  and  Color 

In  computer  vision,  imaging  sensors  are  typically  used 
to  measure  certain  properties  of  light  across  a  plane. 
Two  of  the  properties  of  greatest  interest  in  computer 
vision  are  irradiance  and  color.  Irradiance  is  a  single 
measure  of  how  much  visible  light  strikes  an  area  of  a 


o  °  o 


o  o  o  o 
o 


Figure  2.  Colorant  Layer  Scattering 


sensor  plane.  It  is  defined  as  power  per  unit  area.  Irra¬ 
diance,  therefore,  contains  no  direct  information  about 
the  spectral  properties  of  the  light  striking  the  sensor 
plane,  i.e.,  many  different  spectral  power  distributions 
will  give  the  same  image  irradiance.  We  define  the  color 
of  light  striking  a  sensor  plane  to  be  the  light’s  spectral 
power  distribution  in  the  visible  range.  For  an  image 
of  a  reflecting  surface,  image  irradiance  and  color  will 
depend  on  the  properties  of  the  light  illuminating  the 
surface,  the  spectral  reflectance  of  the  surface  material, 
and  the  geometry  of  reflection.  A  detailed  discussion  of 
the  sensing  and  representation  of  irradiance  and  color 
in  computer  vision  is  given  in  [3], 

2.4  The  Reflectance  Model 

We  quantify  the  reflectance  R  of  a  surface  by 

R{0i,9v,9p<  A)  =  Rs{9i,  9„,  9P,  A)  -f  Rb(0i,9v,9p,  A) 

0) 

where  Rs  is  the  specular  reflectance  term  and  Rg  is  the 
body  reflectance  term  due  to  colorant  layer  scattering. 
The  angles  9i,9v,  and  9p  are  the  photometric  angles.  0/ 
denotes  the  angle  between  the  surface  normal  and  the 
illumination  direction.  9V  denotes  the  angle  between 
the  surface  normal  and  the  viewer.  9p  denotes  the  an¬ 
gle  between  the  illumination  direction  and  the  viewing 
direction.  As  usual,  A  denotes  wavelength. 

The  power  of  the  light  reflected  towards  a  viewer  is 
given  by 

I(9l,9v,9pi\)  =  R(9l,9v,9p,\)L(\)  (2) 

where  L(A)  is  the  spectral  power  distribution  of  the 
light  incident  on  the  surface. 
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Table  1.  Electromagnetic  Properties  of  Matter 


Property 

Symbol 

Units 

Electrical  permittivity 

€ 

C'2/(N  ■  m2) 

Magnetic  permeability 

M 

N  ■  s2/C 2 

Conductivity 

a 

(Q  •  m)~l 

3  The  Optical  Properties  of  Matter 

In  this  work,  we  treat  light  as  an  electromagnetic  wave. 
While  at  a  fundamental  level  certain  properties  of  light 
are  best  described  in  terms  of  localized  quanta  of  en¬ 
ergy,  the  light  we  measure  in  this  work  transports  rel¬ 
atively  large  amounts  of  energy.  Hence  the  effects  of 
individual  quanta  cannot  be  distinguished.  For  this 
work,  electromagnetic  theory  is  quite  adequate  for  de¬ 
scribing  the  propagation  of  light. 

Given  our  view  of  light  as  an  electromagnetic  wave, 
the  characteristics  of  the  interaction  between  light  and 
material  depend  on  the  material’s  electrical  and  mag¬ 
netic  properties.  Those  material  properties  which  ap¬ 
pear  in  Maxwell’s  equations  and  thereby  determine  the 
interaction  are  summarized  by  the  quantities  e,  p,  a 
(Table  1). 

The  magnetic  permeability  of  all  materials,  with  the 
exception  of  ferromagnetic  materials,  is  very  nearly 
equal  to  the  permeability  of  free  space  /i0.  Since  ferro¬ 
magnetic  materials  are  scarce,  we  do  not  consider  them 
in  this  work.  Consequently,  we  take  n  —  hq. 

It  is  important  to  note  that  e  and  a  are  not  constants 
for  a  given  material,  but  in  fact  depend  on  the  wave¬ 
length  of  the  light  being  considered.  This  dependence 
on  wavelength  will  prove  to  be  important  when  we  con¬ 
sider  the  spectral  properties  of  the  light  reflected  from 
a  material. 

The  optical  properties  of  a  material  are  summarized 
by  the  complex  index  of  refraction  M  =  n  —  t  A'o  where 
both  n  and  A'c  are  real.  The  dimensionless  quantity  n 
is  called  the  refractive  part  of  M .  In  a  nonattenuating 
medium  (A'o  =  0),  n  is  defined  as  the  ratio  of  the  speed 
of  light  in  a  vacuum  to  the  speed  of  light  in  the  material. 
For  absorbing  media  (A'o  >  0),  the  interpretation  of  n 
is  more  complicated.  A'o  is  called  the  absorptive  part  of 
M  or  the  extinction  coefficient.  If  the  amplitude  of  an 
electromagnetic  wave  in  a  material  does  not  decrease 
as  the  wave  propagates  in  the  material,  then  K0  =  0. 
In  a  homogeneous  absorbing  material,  the  amplitude 
of  a  light  wave  is  exponentially  attenuated.  If  a  plane 
wave  of  angular  frequency  w  traveling  in  the  x  direction 
encounters  an  interface  at  x  =  0,  then  the  irradiance 


/(x)  at  a  depth  x  in  the  material  is  given  by 

J(x)  =  /( Q)e~^Kor/c  (3) 

where  c  is  the  speed  of  light  in  a  vacuum.  The  dis¬ 
tance  x  =  c/(2wKo)  is  often  called  the  skin  depth  for  a 
material.  From  equation  3,  we  see  that  the  irradiance 
decreases  by  a  factor  of  e-1  as  the  light  wave  penetrates 
to  the  skin  depth. 

It  is  possible  to  solve  for  n  and  Kq  in  terms  of  the 
fundamental  electromagnetic  properties  of  the  mate¬ 
rial  by  requiring  that  the  incident  light  wave  satisfies 
Maxwell’s  equations.  The  resulting  expressions  for  n 
and  Kq  are  given  by 


Since  e  and  a  are,  in  general,  functions  of  wavelength, 
both  n  and  A'o  are  therefore  functions  of  wavelength. 

Given  isotropic  homogeneous  media,  the  Fresnel 
equations  describe  the  light  which  is  specularly  re¬ 
flected  from  an  interface  in  terms  of  M  and  geometry. 
If  unpoJarized  light  is  incident  at  an  angle  0 /,  then  the 
monochromatic  specular  reflectance  Rs  is 

Rs  —  0.5(/?_>_  +  /£||)  ((3) 

where  R±_  is  the  perpendicular  polarized  component 
and  Ay  is  the  parallel  polarized  component.  From  elec¬ 
tromagnetic  theory,  Ax  and  Ay  are  given  by 

_  a2  +  62  —  2 acosOi  +  cos20t 
1  a2  +  b2  +  2acos0i  +  cos~0i 


R\\  = 


a2  +  62  —  2asin0itan0i  +  sin20itan29i 
a2  +  62  +  2asin0itan9i  +  sin20^tan20i 


where 


(?) 

(0) 


g  —  n2  —  A’o  —  sin~0i 


(11) 


Equations  7  and  8  are  known  as  the  Fresnel  equations. 
The  Fresnel  equations  are  derived  in  many  places  in¬ 
cluding  [1], 

For  the  case  of  normal  incidence,  i.e.  Qt  =  0°,  equa¬ 
tion  (6)  simplifies  to 


Rs(0i  =  0°) 


(»  -  1)2+  A2 

(n  +  1  )J  +  A 


Table  2.  Conductivity  of  Materials 


Material 

Class 

Conductivity 

copper 

metal 

5.9  x  10' 

zinc 

metal 

1.7  x  107 

iron 

metal 

1.0  x  107 

uranium 

metal 

3.4  x  106 

graphite 

semi-conductor 

2.0  x  104 

germanium 

semi-conductor 

101  -  103 

silicon 

semi-conductor 

10"1  -  101 

glass 

ceramic 

10"11  -  10"9 

diamond 

ceramic 

10~u  -  10“10 

porcelain 

ceramic 

10-i2  -  lO"10 

mica 

ceramic 

o 

1 

1 

o 

1 

cn 

fused  quartz 

ceramic 

10-16  _  IQ-18 

4  The  Interaction  of  Light  with  Metals 

In  this  section,  we  examine  the  characteristics  of  the  in¬ 
teraction  of  light  with  metals.  The  most  prominent  dif¬ 
ference  between  metals  and  other  materials  is  the  large 
number  of  electrons  which  are  free  to  move  through¬ 
out  a  metal.  Metals,  therefore,  have  large  values  of 
the  conductivity  a.  By  contrast,  electric  charges  are 
not  free  to  move  in  dielectrics,  making  these  materials 
poor  conductors.  Table  2  shows  the  large  variation  of 
conductivity  among  various  classes  of  materials. 

Another  important  property  of  metals  is  that  they 
are  optically  homogeneous.  As  a  result,  colorant  layer 
scattering  does  not  occur  for  metals  and  Rb  =  0  in 
equation  (1).  Therefore,  the  only  relevant  reflection 
process  for  metals  is  specular  reflection. 

In  subsections  4.1  and  4.2,  we  consider  respectively 
the  intensity  and  spectral  properties  of  the  light  re¬ 
flected  from  a  metal. 

4.1  Intensity  Properties  of  the  Reflected  Light 

The  conductivity  cr  of  a  material  is  defined  as  the  ratio 
of  the  current  density  produced  by  an  electric  field  to 
the  intensity  of  the  electric  field.  In  a  perfect  dielectric, 
there  are  no  free  electrons  and  the  conductivity  is  zero. 
For  metals,  on  the  other  hand,  there  are  many  free  elec¬ 
trons  and  the  conductivity  is  much  greater  than  zero 
(Table  2).  In  the  hypothetical  case  of  a  perfect  conduc¬ 
tor,  we  would  have  cr  =  oo.  Physically  this  would  mean 
that  a  stream  of  electrons  moving  in  a  certain  direction 
would  continue  to  move  in  that  direction  without  re¬ 
sistance.  In  real  metals,  however,  electrons  experience 
resistance  due  to  departures  of  the  solid  lattice  from 


perfect  regularity.  These  departures  from  perfect  reg¬ 
ularity  are  due  to  thermal  motion  of  the  atoms  and 
impurities  in  the  solid.  When  an  electron  is  scattered 
by  the  imperfect  lattice,  electromagnetic  energy  is  ir¬ 
reversibly  converted  to  heat.  This  process  is  in  part 
responsible  for  the  attenuation  of  the  incident  wave  as 
described  by  equation  3. 

We  begin  by  approximating  metals  as  perfect  con¬ 
ductors,  i.e.  we  let  cr  — ►  oo.  If  we  fix  A  to  be  some  vis¬ 
ible  wavelength,  then  n  and  K r,  are  functions  of  only  e 
and  cr.  It  is  not  possible  to  directly  measure  c  for  met¬ 
als.  We  can  assume  from  the  relevant  physics,  however, 
that  e  for  metals  is  the  same  order  of  magnitude  as  for 
dielectrics  [1],  Therefore,  from  equation  5,  we  have 
K0  — >  oo  as  <x  — *  oo.  From  the  Fresnel  equations,  this 
tells  us  that  R$  — ►  1  as  <r  — ►  oo. 

Although  the  conductivity  of  real  metals  is  finite,  the 
approximation  f?s(9/,A)  ss  1  is  reasonable  for  many 
metals  in  the  visible  wavelength  range.  At  the  physical 
level,  this  is  because  the  many  free  electrons  present  in 
a  metal  are  excellent  scatterers  of  light.  Thus  incident 
light  has  little  opportunity  to  penetrate  into  the  metal. 
Consequently,  Ko  is  large  and  metals  have  a  skin  depth 
which  is  only  a  small  fraction  of  a  wavelength.  Few  elec¬ 
trons  are  able  to  absorb  any  energy  from  the  incident 
light,  causing  most  of  the  incident  light  to  be  reflected. 

4.2  Spectral  Properties  of  the  Reflected  Light 

In  this  section  we  investigate  certain  physical  properties 
of  metals  and  explain  how  these  properties  determine 
the  structure  of  a  meial’s  surface  spectral  reflectance 
function.  We  show  from  first  principles  that  the  sur¬ 
face  spectral  reflectance  functions  for  metals  are  highly 
constrained.  This  is  of  great  use  in  material  identifica¬ 
tion  since  it  allows  a  system  to  readily  decide  from  a 
material’s  estimated  spectral  reflectance  if  that  mate¬ 
rial  could  possibly  be  a  metal. 

The  optical  properties  of  a  metal  are  largely  deter¬ 
mined  by  how  light  interacts  with  the  electrons  in  the 
metal.  We  can  divide  the  various  possible  interactions 
into  two  classes.  In  the  first  class  of  interaction,  light 
is  scattered  by  a  free  conduction  electron.  This  class 
of  interaction  is  easily  the  most  common  in  metals  be¬ 
cause  of  their  large  number  of  free  electrons.  In  the  sec¬ 
ond  class  of  interaction,  the  light’s  energy  is  absorbed 
by  the  metal  as  an  electron  jumps  to  a  higher  energy 
level.  We  examine  in  more  detail  the  properties  of  these 
two  classes  of  interaction  in  the  following  paragraphs. 

The  most  prominent  process  which  contributes  to  the 
optical  properties  of  a  metal  is  the  interaction  of  light 
with  free  conduction  electrons.  As  explained  in  4.1, 
these  free  electrons  scatter  light  very  effectively  causing 
metals  to  have  a  small  skin  depth  and  large  reflectance 


values.  Since  these  free  electrons  have  no  natural  fre¬ 
quencies,  they  scatter  light  of  all  wavelengths  equally 
well.  This  is  why  most  metals  (e.g.  aluminum,  silver, 
sodium)  have  spectral  reflectance  functions  which  do 
not  vary  with  wavelength  in  the  visible  range  and  thus 
appear  silver  or  gray  in  color. 

In  addition  to  being  scattered  by  free  electrons,  light 
can  also  be  absorbed  during  interactions  with  electrons 
which  are  confined  to  exist  in  certain  energy  zones 
(called  Brillouin  zones).  Each  zone  is  separated  from 
adjacent  zones  by  a  gap  of  disallowed  energies.  For 
a  certain  range  of  incident  light  wavelengths,  the  en¬ 
ergy  of  the  incident  light  will  match  the  energy  gap 
between  zones.  In  this  situation,  an  atom  will  absorb 
the  light  allowing  an  electron  to  jump  to  a  higher  en¬ 
ergy  zone.  This  absorbed  energy  is  typically  converted 
rapidly  into  heat.  The  important  point  is  that  there  is 
a  critical  wavelength  Ao  such  that  incident  light  energy 
is  not  absorbed  for  A  >  Ao  and  for  decreasing  A  there 
is  a  dramatic  increase  in  the  amount  of  incident  light 
energy  absorbed  at  Ao.  This  absorbed  light  energy  can¬ 
not  appear  in  the  reflected  wave.  For  decreasing  wave¬ 
length,  therefore,  n  and  K o  behave  in  such  a  way  that 
there  is  a  dramatic  decrease  in  spectral  reflectance  at 
Ao- 

For  most  metals,  there  is  no  absorption  of  light  in 
the  visible  wavelength  range.  Thus  most  metals  have 
spectral  reflectance  functions  which  are  constant  with 
respect  to  wavelength  in  the  visible.  The  most  notable 
exceptions  to  this  rule  are  copper  and  gold,  both  of 
which  have  absorption  bands  in  the  visible  [9].  The 
spectral  reflectance  functions  for  both  copper  and  gold 
are  near  one  for  long  visible  wavelengths  and  decrease 
sharply  below  their  respective  critical  wavelengths  Ao- 
Figure  3  shows  the  spectral  reflectance  functions  for 
copper  and  gold.  Copper  and  gold  strongly  absorb 
blue  and  green  light  while  reflecting  red  light.  This 
gives  these  metals  their  reddish  colors.  Other  metals 
have  absorption  bands  which  do  not  lie  in  the  visible. 
Silver,  for  example,  begins  absorbing  in  the  ultra-violet 
at  about  Ao  =  310nm. 

We  have  shown  that  the  physical  properties  of  metals 
provide  strong  constraints  on  their  spectral  reflectance 
functions.  The  scattering  of  light  by  free  conduc¬ 
tion  electrons  causes  most  metals  to  have  spectral  re¬ 
flectance  functions  which  are  constant  over  the  visible 
wavelengths  (e.g.  silver,  aluminum,  sodium,  tin,  potas¬ 
sium,  niobium,  scandium,  cesium,  yttrium,  vanadium, 
niobium,  osmium).  For  other  metals,  absorption  bands 
in  the  visible  give  rise  to  critical  wavelengths  Ao  such 
that  the  metal  reflects  strongly  for  A  >  Ao  and  absorbs 
strongly  for  A  <  A0  (e.g.  copper,  gold).  Note  that  these 
are  the  only  possible  kinds  of  spectral  reflectance  func- 
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Figure  3.  Spectral  Reflectance  of  Copper  and  Gold 

tions  for  a  metal.  We  would  not  expect,  for  example, 
to  see  metals  with  spectral  reflectance  functions  that 
decrease  sharply  with  increasing  wavelength.  This  is 
why  there  are  no  metals  that  appear  green  or  blue. 
Consequently,  it  is  often  easy  to  decide  from  a  spectral 
reflectance  function  that  a  material  is  not  a  metal. 

5  Light  and  Dielectrics 

In  this  section,  we  examine  the  characteristics  of  the 
interaction  of  light  with  dielectric  materials.  The  most 
significant  difference  between  dielectrics  and  metals  is 
that  dielectrics  have  very  small  conductivities  com¬ 
pared  to  metals  (Table  2).  Physically  this  occurs  be¬ 
cause  in  a  dielectric  all  Brillouin  zones  which  contain 
any  electrons  are  full  and  there  exist  large  energy  gaps 
separating  these  full  zones  from  adjacent  empty  zones. 
Hence  an  applied  electric  field  does  not  produce  a  cur¬ 
rent  and  the  material  is  an  insulator. 

Another  important  difference  between  dielectrics  and 
metals  is  that  many  dielectric  materials  are  optically 
inhomogeneous.  Therefore,  for  most  dielectrics  both 
specular  reflection  and  colorant  layer  scattering  are  im¬ 
portant  optical  processes. 

5.1  Homogeneous  Dielectrics 

In  this  subsection,  we  consider  the  properties  of  a  ho¬ 
mogeneous  material  which  is  a  perfect  dielectric.  For  a 
perfect  dielectric,  the  conductivity  c  is  zero.  Therefore, 
from  (4)  and  (5)  we  have 

n  =  vVo  cc2  (13) 

A'o  =  0  (14) 
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For  homogeneous  dielectrics,  n  defined  by  (13)  is 
equal  to  the  ratio  of  the  speed  of  light  in  a  vacuum 
to  the  speed  of  light  in  the  material.  Since  Ko  is  zero 
in  (14),  we  see  that  a  homogeneous  dielectric  medium 
does  not  attenuate  the  incident  light.  The  normal  in¬ 
cidence  Fresnel  reflectance  for  such  a  material  is 

=  Sttf  (15) 

For  the  values  of  n  corresponding  to  most  dielectrics, 
Rs  is  small.  In  particular,  Rs  for  metals  is  typically 
much  larger  than  Rs  for  dielectrics. 

5.2  Inhomogeneous  Dielectrics 

Most  dielectric  materials,  with  the  exception  of  crys¬ 
tals.  are  optically  inhomogeneous.  Some  examples  of 
inhomogeneous  dielectric  materials  are  plastics,  paints, 
textiles,  and  ceramics.  As  explained  in  2.2,  the  body 
of  an  inhomogeneous  material  consists  of  a  vehicle  and 
many  embedded  colorant  particles.  The  refractive  in¬ 
dex  n  of  the  vehicle  determines  the  specular  reflectance 
of  the  material  from  the  Fresnel  relations  (e.g.  as  in 
equation  15).  For  most  vehicle  materials,  n  is  only 
weakly  dependent  on  A  [12].  The  fraction  of  the  in¬ 
cident  light  which  enters  the  body  of  the  inhomoge¬ 
neous  dielectric  interacts  with  many  colorant  parti- 
i-s.  These  colorant  particles  selectively  absorb  certain 
wavelengths  and  thereby  cnange  the  spectral  properties 
of  the  incident  light.  The  colorant  particles  also  serve 
to  diffuse  the  incident  light  and  much  of  it  is  scattered 
out  of  the  body  of  the  material  in  many  directions. 
The  modified  kubelka-Munk  theory  [8],  [11]  provides 
a  no  de!  for  the  scattering  of  light  from  inhomogeneous 
d  ed'-ct  ri  's.  The  use  of  this  model  is  discussed  in  [5]. 

6  Identifying  Metals  and  Dielectrics 

in  s*'ct;.tn>  !  and  5.  we  showed  that  there  are  signif- 
e'.o.t  idh  rejiccs  in  tin'  ways  in  which  metals  and  cli- 
-  ms*  rac|  with  light.  In  this  section,  we  sum- 

i;;ar:/”  t  h .  differences.  We  are  currently  developing 

i  ••  up1:1  i : rial  technique  based  on  th-se  differences 

i  :  ;  ■::i''a)  c< -nM-l'T/t t ions  which  c an  be  used  to 

"..-taU  and  dielectrics  in  images 
i:  1  w-  showed  ’hat  metal-  r  -fleet  almost  all 

i  i' nt  Imht  m  m  arly  a  single  dir-  et urn  We  also 
i  •:  it  tin-  spectral  nib-dance  functions  for  ttn-t 
nr-  1  i.ddy  -  ■  -list rauu-'l  I  1 1 •  •  r- ■  is  no  colorant  layer 
-■■a"-  r:  _•  f-  r  nr-’ al-  Tint',  for  a  single  sp, viral  pow.-r 
' .  -'  r  •  . '  i  -  -  r  i  <>f  ill  ii  min  at  i-  n.  the  co|.  ,r  of  light  reflected 
•  r-  oil  a  i;  :•  t  ■!  will  te-ark  constant  with  only  slight 
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changes  due  to  changing  geometry.  Measurements  are 
given  in  [2]  demonstrating  this  property. 

In  section  5,  we  considered  the  optical  properties 
of  dielectrics.  For  inhomogeneous  dielectrics,  colorant 
layer  scattering  is  an  important  optical  process.  Di¬ 
electrics  typically  have  a  small  specular  reflectance 
component.  This  specular  reflectance  component  usu¬ 
ally  depends  only  weakly  on  wavelength.  The  light 
scattered  from  the  body  of  a  dielectric  is  distributed 
over  a  wide  range  of  angles  and  is  generally  of  a  differ¬ 
ent  color  than  the  specularly  reflected  light.  Measure¬ 
ments  depicting  this  color  shift  are  given  in  [4]  and  [7], 
Another  distinguishing  property  of  dielectrics  is  that 
their  body  spectral  reflectance  functions  can  take  on 
much  more  general  forms  than  the  spectral  reflectance 
functions  for  metals. 

In  this  work,  we  have  considered  primarily  clean, 
smooth  surfaces.  We  have  not  taken  into  account  the 
complicating  effects  of  surface  roughness  and  surface 
impurities  (e.g.  oxides  on  metals)  which  can  affect  the 
properties  of  the  light  scattered  by  a  surface.  The  ef¬ 
fects  of  both  roughness  and  impurities  are  examined  in 
[6].  We  believe,  however,  that  the  fundamental  nature 
of  the  differences  between  metals  and  dielectrics  will 
allow  them  to  be  distinguished  in  images  despite  the 
effects  of  roughness  and  impurities. 
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Abstract 


Texture  segmentation  may  be  separated  into  two  phases: 
edge  detection,  followed  by  edge  localization;  this  paper 
focuses  on  the  latter  topic.  In  the  first  part  of  this  paper,  we 
use  some  simple,  abstracted  textures  to  illustrate  the  issues 
involved  in  texture  edge  localization.  We  discuss  some  of 
the  limits  of  localization.  The  second  part  suggests  a  simple 
approach  to  localization  based  on  interpreting  the  responses 
of  edge  detectors  spaced  at  fixed  intervals  in  position  and 
orientation  space.  The  results  of  this  algorithm  as  applied 
to  some  synthetic  and  natural  textures  are  presented. 


1  Introduction 


Texture  segmentation  seeks  to  locate  image  curves  that  sep¬ 
arate  different  textures.  The  segmentation  problem  may  be 
separated  into  two  phases:  detection  of  texture  differences 
between  regions,  and  localization  of  the  edges  thus  detected. 
The  current  paper  focuses  on  the  latter  subject,  edge  local¬ 
ization. 

We  briefly  review  the  current  approach  to  texture  anal¬ 
ysis;  the  reader  is  encouraged  to  refer  to  [10]  and  [11]  for 
a  more  complete  discussion.  We  define  textures  by  statisti¬ 
cal  distributions,  and  define  texture  edges  by  differences  in 
distributions  between  two  regions.  Edge  detection  amounts 
to  estimating  distributions  over  two  regions  and  comparing 
the  estimated  parameters.  We  use  the  statistical  signifi¬ 
cance  of  the  difference  between  parameters  as  a  measure  of 
the  difference  between  two  textures.  As  a  simple  example, 
if  a  texture  is  composed  of  pixels  with  normally-distributed 
intensity,  the  distribution  is  defined  by  its  mean  and  vari¬ 
ance,  which  may  be  estimated  over  a  region.  Statistical 
techniques  can  be  used  to  compute  the  probability  of  a 
mean/variance  difference  as  large  as  observed,  under  the 
null  hypothesis  that  the  underlying  distributions  are  iden¬ 
tical  in  the  two  regions. 

Texture  discrimination  requires  making  assumptions 
about  the  nature  of  the  distributions  from  which  the  tex¬ 
tures  arise.  If  the  domain  is  controlled  (e.g.,  a  factory), 
assumptions  can  be  based  on  scene  knowledge.  When  faced 
with  images  of  unknown  scenes,  one  must  make  general  as¬ 
sumptions  that  allow  one  to  discriminate  a  large  number 
of  different  textures.  From  such  assumptions,  one  derives  a 
set  of  texture  features  to  measure  from  the  image;  discrim¬ 
ination  is  based  on  comparisons  of  such  texture  measures, 
and  the  textures  discriminable  by  a  vision  system  depend 
on  these  measures  A  set  of  texture  measures  that  allows  re¬ 


liable  discrimination  of  natural  textures  is  discussed  in  [11]; 
the  demonstrations  discussed  below  use  those  measures. 

The  current  approach  to  texture  segmentation  is  based 
on  computing  the  “strength”  of  a  texture  edge  at  many 
positions  in  the  image,  and  then  interpreting  the  resulting 
array  of  edge  strengths  to  localize  the  edges.  Edge  oper¬ 
ators  that  compute  the  statistical  significance  of  a  texture 
boundary  for  a  hypothesized  edge  between  two  regions  are 
discussed  at  length  in  [10].  Generally,  the  maximum  edge 
significance  occurs  when  the  hypothesized  edge  coincides 
with  the  underlying  edge.  However,  the  edge  significance 
is  high  when  the  hypothesized  edge  is  near  the  underlying 
edge  as  well.  The  problem  of  edge  localization  is  to  identify 
the  image  locations  that  provide  the  best  estimate  of  the 
underlying  edges. 

Edge  localization  is  a  difficult  problem.  In  simple  images, 
edges  may  be  well  defined  and  easily  localized,  and  in  these 
cases  it  is  reasonable  to  design  perceptually  valid  localiza¬ 
tion  algorithms.  For  example,  the  edges  in  a  blocks-world 
scene  are  typically  well  defined  and  can  easily  be  identified 
by  a  human  observer.  However,  in  more  complex  scenes, 
such  as  natural  scenes,  the  dominant  edges  may  be  very 
difficult  to  identify.  The  scale  problem  is  particularly  diffi¬ 
cult:  edges  exist  at  many  scales,  and  the  appropriate  scale 
of  analysis  must  be  determined  by  the  application.  A  natu¬ 
ral  scene  may  contain  boundaries  between  a  forest  and  the 
sky,  between  individual  trees  in  the  forest,  and  edges  around 
the  leaves  within  the  trees;  depending  on  the  application, 
one  or  more  scales  of  edge  may  be  appropriate. 

In  this  paper,  we  first  examine  some  simple,  abstracted 
texture  images  that  illustrate  many  of  the  issues  involved 
in  edge  localization.  We  discuss  the  limits  of  localization  in 
such  textures.  The  remainder  of  the  paper  focuses  on  one 
simple  approach  to  localization.  This  approach  is  based  on 
computing  edge  significances  at  many  locations  and  orien¬ 
tations  in  the  image,  and  finding  patterns  in  the  resulting 
edge  significance  arrays  that  correspond  to  image  texture 
edges.  The  edge  localization  algorithm  is  demonstrated  on 
a  natural  textured  image 


Edge  Localization  Limits 


Texture  edge  localization  is  limited  by  the  information 
present  in  the  image.  Estimates  of  texture  edge  location 
are  based  on  comparisons  of  estimated  parameters  of  tex¬ 
ture  distributions.  In  one  important  type  of  texture,  the 
relevant  texture  features  are  attributes  (such  as  length  or 
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orientation)  o{  image  structures.  Since  these  image  struc¬ 
tures  are  sparse  in  the  image,  texture  information  is  sparse: 
where  there  are  no  structures,  there  is  no  information.  As 
a  result,  localization  accuracy  increases  with  the  density  of 
image  structures. 

For  the  purpose  of  exposition  in  the  following  discussion, 
let  us  consider  the  abstracted  texture  image  depicted  in  Fig¬ 
ure  la.  This  texture  consists  of  a  two-dimensional  array  of 
uniformly-placed  elements  with  one  attribute;  the  attribute 
may  be  brightness,  orientation,  etc.,  but  in  the  figure,  it  is 
represented  as  size.  (In  the  case  of  general  textures,  the  el¬ 
ements  have  many  attributes  rather  than  just  one,  and  the 
elements  are  not  well  delineated  in  the  image.)  The  array 
is  divided  into  two  regions,  the  left  hand  side  (LHS)  and 
right  hand  side  (RHS)1;  the  element  attributes  on  either 
side  are  chosen  from  different  normal  distributions.  These 
2D,  uniform  textures  can  be  characterized  by  their  means 
and  variances:  h i  and  on  the  LHS,  and  hr  and  cr^  on 
the  RHS.  We  shall  say  that  the  textures  differ  when  their 
means  differ,  i.e.,  when  hl  ^  HR- 

In  this  highly  simplified  domain,  texture  segmentation 
involves  locating  the  curve  that  separates  the  two  regions. 
Imagine  a  curve  C  that  passes  through  the  texture  from 
top  to  bottom.  For  each  such  curve  we  can  measure  the 
sample  means  (3Tl  and  Fr)  and  variances  (er\  and  cr\)  on 
either  side  of  C,  and  compute  the  statistical  significance  of 
the  edge  across  C  by  testing  the  hypothesis  H0  :  hl  —  HR- 
More  precisely,  a  significance  test  computes  the  probability 
P(D  (  Ho),  which  decreases  as  D,  the  observed  difference 
in  sample  means,  increases.  It  is  more  natural  to  use  as  a 
difference  measure  the  significance  of  the  difference,  A (C)  — 
—  log  P(D  \  Ho),  which  is  always  positive  and  grows  with 
the  significance  of  the  edge2. 

In  image  segmentation,  we  generally  wish  to  find  the 
“simplest”  curve  through  the  image,  choosing  a  linear 
boundary  if  possible.  Suppose  we  attach  to  a  curve  C  a 
measure  of  simplicity  5(C)  that  is  large  when  C  is  linear 
and  decreases  as  C  becomes  more  curved  or  jagged.  (One 
ad  hoc  measure  of  curve  simplicity  is  the  ratio  of  endpoint- 
endpoint  length  to  total  arc  length;  this  is  unity  for  lines 
and  decreases  with  the  jaggedness  or  curviness  of  the  curve.) 
Our  criteria  for  the  choice  of  a  boundary  curve  C  depend 
on  some  combination  of  its  simplicity  5(C)  and  the  sig¬ 
nificance  A(C)  of  the  edge  across  it.  The  edge  localization 
problem  amounts  to  maximizing  this  combined  quality  mea¬ 
sure  over  all  C.  In  general,  several  curves  may  satisfy  the 
“goodness”  criteria  equally  well;  this  is  one  source  of  uncer¬ 
tainty  in  localization.  In  practice,  of  course,  it  is  impossible 

1  Imagine  two  textures  drawn  on  paper;  cut  a  boundary  in 
one  texture  and  place  it  on  the  other.  We  wish  to  locate  the 
boundary. 

2This  method  of  measuring  the  distance  between  texture  re¬ 
gions  is  similar  to  the  Mahalanobis  distance  [9],  which  is  a 
variance-weighted  measurement  of  difference  between  means. 
However,  the  Mahalanobis  distance  fails  to  consider  the  number 
of  samples  used  to  estimate  the  mean  and  variance;  the  statisti¬ 
cal  significance  of  any  given  difference  in  means  grows  with  the 
sample  size. 
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Figure  1:  (a)  An  abstracted  texture.  Element  attributes 
are  chosen  from  zero-variance  distributions,  (b)  An  under¬ 
lying  straight  edge  can  be  anywhere  between  the  convex 
hulls  of  the  two  regions,  (c)  If  the  edge  can  be  an  arbitrary 
curve,  these  two  curves  are  among  the  possibilities. 

to  test  all  curves  passing  through  the  image;  they  must  be 
constrained  in  some  way.  One  solution  is  to  consider  only 
piecewise  linear  curves,  or  ‘edgels’,  at  fixed  positions  and 
orientations  in  the  image.  Since  linear  C  have  identical 
simplicity  5(C),  the  problem  of  measuring  curve  simplic¬ 
ity  is  largely  avoided.  Measuring  5(C)  in  a  well-motivated 
manner  remains  an  open  problem. 

2.1  Zero  variance  textures 

Consider  first  the  case  in  which  the  attribute  variance  in 
both  LHS  and  RHS  is  zero,  as  in  Figure  la.  In  this  situ¬ 
ation,  it  is  possible  to  classify  elements  based  on  their  at¬ 
tribute  value,  and  place  the  boundary  between  the  clusters 
thus  formed.  However,  it  is  rarely  possible  to  localize  the 
boundary  exactly.  In  the  simplest  case,  the  boundary  is 
known  to  be  linear,  and  is  then  restricted  to  fall  between 
the  convex  hull  bounding  the  two  regions,  as  shown  in  Fig¬ 
ure  1  b. 
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Figure  2:  Elements  from  distributions  with  nonzero  vari¬ 
ance.  Means  on  left  and  right  are  2  and  4;  both  have  vari¬ 
ance  1. 


When  the  boundary  is  not  known  to  be  linear,  it  is  no 
longer  restricted  to  fall  between  the  convex  hulls  of  the  re¬ 
gions,  but  may  be  any  of  infinitely  many  curves  separating 
the  two  textures;  two  such  curves  are  shown  in  Figure  lc. 
Again,  we  generally  wish  to  select  the  “simplest”  of  these 
curves,  choosing  a  linear  or  piecewise  linear  boundary  if 
possible;  this  choice  depends  on  the  measure  of  simplicity 
5(C). 

2.2  Nonzero  variance  textures 

Next,  consider  the  case  where  elements  have  nonzero  vari¬ 
ance,  as  shown  in  Figure  2.  Now  the  elements  cannot  be 
classified  with  certainty,  and  it  is  no  longer  possible  to  re¬ 
strict  the  boundary  to  the  region  between  the  clusters. 

Suppose  the  edge  is  known  to  be  linear  and  vertical,  and 
we  need  only  find  its  horizontal  (z)  position.  Since  the 
shape  of  the  boundary  curve  C  is  constant,  its  simplicity 
5(C)  is  constant,  and  the  localization  problem  reduces  to 
maximizing  the  texture  difference  A (C)  across  the  curve. 
Denoting  by  Cx  the  curve  at  position  z,  the  straightforward 
way  to  localize  the  boundary  is  to  calculate  the  edge  signifi¬ 
cance  A(Cr)  at  several  positions,  with  the  best  estimate  for 
boundary  location  corresponding  to  the  z  that  maximizes 
A(Cr).  In  the  case  of  zero  variance  (Figure  1),  A(CX)  is 
infinite  when  i  is  in  the  region  between  the  two  clusters;  as 
i  moves  to  the  right  of  this  region,  the  sample  means  grow 
closer,  it;  increases  rapidly,  and  A(CZ)  decreases  mono- 
tonically.  When  the  variances  are  nonzero,  there  is  still  a 
peak  in  significance  near  the  underlying  edge,  but  due  to 
random  variation  this  peak  may  not  coincide  with  the  true 
edge,  and  the  significance  may  not  fall  off  monotonicallv 
on  either  side.  Notice  that  A(Ci)  is  not  continuous  in  r, 
but  is  instead  a  step  function,  since  A (Cx)  changes  only 
when  z  crosses  an  image  element.  As  a  result,  the  z  that 
maximizes  A(CT)  can  only  be  restricted  to  lie  in  an  in¬ 
terval  whose  size  is  proportional  to  the  average  separation 
between  elements.  That  is,  localization  accuracy  increases 
with  texture  element  density. 

When  both  the  orientation  and  position  of  the  boundary 
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Figure  3:  Illustration  of  one  aspect  of  the  scale  problem. 
There  is  a  large  scale  boundary  at  A,  as  well  as  small  scale 
boundaries  at  B  around  the  clusters  of  elements  with  zero 


are  unknown,  the  best  estimate  of  the  texture  boundary  cor¬ 
responds  to  the  (z,0)  edge  parameters  that  maximize  the 
edge  significance;  one  must  search  over  the  two-dimensional 
x  and  9  space.  This  is  of  rou^"  a  much  more  complex  prob¬ 
lem  than  that  of  localizing  in  z  alone.  In  general,  there  are 
an  unknown  number  of  image  boundaries  rather  than  just 
one.  In  this  situation,  there  is  no  single  “best  estimate”  for 
edge  location,  since  when  attributes  have  nonzero  variance, 
A (C)  may  be  significant  for  C  falling  between  many  non¬ 
overlapping  regions.  The  scale  problem  is  one  result,  of  un¬ 
known  boundary  number  and  size.  Figure  3  shows  a  simple 
example:  at  a  large  scale,  the  boundary  at  A  is  significant 
since  it  separates  elements  with  small  variance  and  widely 
different  mean.  On  the  other  hand,  the  boundaries  at  B. 
surrounding  the  clusters  of  elements  with  zero  variance,  are 
highly  significant  if  the  regions  over  which  distributions  are 
measured  are  appropriately  small.  Depending  on  the  ap¬ 
plication,  it  might  be  appropriate  to  detect  the  mall-  or 
large-scale  boundaries,  or  both. 

2.3  Direct  estimation  of  edge  location 

The  problem  of  estimating  the  boundary  position  from  the 
sampled  attribute  values  is  similar  to  estimating  the  mean 
of  n  samples  u,  of  a  random  variable  U.  From  the  u, ,  one 
may  directly  compute  a  statistic,  the  sample  mean,  that 
estimates  the  mean;  one  may  also  compute  the  variance  of 
this  statistic,  and  derive  a  confidence  interval  for  the  mean. 
In  texture  edge  localization,  we  wish  to  estimate  the  posi¬ 
tion  of  the  underlying  texture  boundary  and  an  associated 
variance  of  the  estimate  (c.f.  [3]).  The  iterative  procedure 
described  above  for  estimating  the  edge  location  z  corre¬ 
sponds  to  estimating  the  true  mean  /i  of  n  samples  by  cal¬ 
culating  the  significance  of  the  hypothesis  /i  =  c  for  various 
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c,  and  choosing  the  c  that  maximizes  the  significance.  It 
would  be  desirable  to  determine  directly  an  estimate  for 
edge  location  and  its  variance. 

Unfortunately,  there  seem  to  exist  no  methods  for  this 
direct  estimation.  Consider  the  extremely  simple  one¬ 
dimensional  case  of  Figure  2,  that  is,  a  set  of  random  vari¬ 
ables  (X\ , . . . ,  Xt),  where  X,  follows  distribution  F  for 
i  =  1, . . .  t  and  distribution  G  for  j  =  i  +  1, . . .  T.  The 
problem  of  determining  t,  the  point  at  which  the  distri¬ 
bution  changes,  is  known  in  statistics  as  the  change-point 
problem,  and  has  been  extensively  studied  (e.g.,  [5,6,12,2]). 
The  results  relevant  to  our  purposes  may  be  summarized  as: 
no  explicit  form  exists  for  calculating  the  estimated  change 
point  t.  Instead,  f  must  be  computed  sequentially,  by  test¬ 
ing  the  significance  of  the  difference  between  estimated  dis¬ 
tributions  on  either  side  of  conditional  change-points  /;  the 
estimated  t  is  that  value  of  t  which  maximizes  the  signif- 
I  icance.  In  addition,  estimating  the  distribution  of  t  is  a 

nearly  intractable  problem,  even  asymptotically  [12].  For 
normally  distributed  X ,  with  means  p\  and  fi2,  and  com¬ 
mon  variance  <r2,  Hinkley  [5]  shows  that  the  variance  of  t 
depends  asymptotically  on  A  =  |pi  —  P2\ /2 <r.  When  the 
difference  in  means  p\  —  pi  is  large,  P(t  =  t)  as  [$(A)]‘, 
where  <£  denotes  the  standard  normal  integral;  thus  the 
probability  of  getting  the  “right  answer”  increases  with  the 
difference  between  means,  as  one  would  expect. 

3  A  Localization  Algorithm 

The  remainder  of  this  paper  discusses  one  simple  approach 
to  edge  localization.  This  method  should  be  regarded  as  an 
ad  hoc  approach  whose  intent  is  simply  to  demonstrate  that 
the  texture  measures  described  in  [11]  suffice  to  segment  real 
textured  images  in  a  reasonable  manner. 

The  approach  taken  here  involves  calculating  A  for  a 
fixed  set  of  edge  operators  densely  placed  in  z,  y,  8  space, 
and  interpreting  the  pattern  '  responses.  Our  goal  is  to 
locate  short  linear  edge  segments,  or  ‘edgels’  (cf.  [7])  that 
approximate  the  image  edges.  We  choose  an  edgel  length 
L,  and  assume  that  image  edges  have  length  at  least  L\ 

1  shorter  edges  may  be  ignored.  The  input  to  the  localiza- 

i  tion  algorithm  is  a  set  of  edge  significance  arrays  Ee{x,y), 

I  which  contain  at  each  point  ( x,y )  the  significance  of  a  tex- 

i  ture  edge  centered  at  (x,  y )  with  orientation  8  and  length  L. 

\  Each  edge  significance  array  Ee  contains  information  about 

i  the  edges  in  directions  near  8.  The  edge  operator  used  here 

is  described  in  [10]  and  depicted  in  Figure  4;  the  texture 
measures  are  described  in  [11].  For  convenience,  the  width 
IV  of  the  edge  operators  is  taken  to  be  L/ 2,  so  the  operator 
is  square. 

A  texture  edge  in  the  image  produces  a  ridge-shaped  pat¬ 
tern  in  the  edge  significance  array;  the  algorithm  interprets 
these  characteristic  patterns  as  edges.  The  ridge  shape  re¬ 
sults  from  the  fact  that  the  response  of  an  edge  operator 
generally  is  maximum  when  its  parameters  (x,  y ,  8)  corre¬ 
spond  with  those  of  the  underlying  edge,  and  drops  off  as 
its  position  or  orientation  deviate  from  that  of  the  edge.  An 
edge  operator  is  sensitive  to  edges  when  the  orientations  of 
the  edge  and  the  edge  operator  are  sufficiently  close  (within 


Figure  4:  An  edge  operator.  It  divides  its  receptive  field 
into  N  slices  (here,  N  =  4)  on  either  side  of  a  hypothesized 
edge,  as  shown. 


a  few  degrees);  the  orientation  of  the  resulting  ridge  follows 
the  orientation  of  the  underlying  edge.  Figure  5a  shows  an 
image  with  a  horizontal  edge,  at  0  deg,  and  an  oblique  edge  ( 
at  —30  deg.  An  intensity  edge  operator  with  orientation 
9  =  —20  was  applied  to  this  image;  Figure  5b  shows  the 
significance  of  the  edge  at  positions  near  the  corner.  No¬ 
tice  that  the  edge  operator  response  forms  a  ridge  whose 
orientation  is  —30  deg,  corresponding  to  the  oblique  edge. 
The  operator  responds  weakly  to  the  horizontal  edge,  since 
their  orientations  are  quite  different.  Observe  that  the  sig¬ 
nificance  remains  high  for  positions  aligned  with  the  edge 
but  beyond  its  termination.  Because  the  edge  detector  sums 
slices  longitudinally,  it  is  insensitive  to  terminations  or  cor¬ 
ners  of  edges.  We  shall  briefly  discuss  a  method  for  dealing 
with  corners  below. 

We  detect  ridges  by  searching  for  them  in  each  of  the 
edge  significance  arrays.  Such  ridges  generally  correspond 
to  image  edges.  The  union  of  these  ridges  over  all  orienta¬ 
tions  provides  a  first  approximation  to  the  estimated  image 
edges.  Figure  6  depicts  the  operation  of  this  phase  of  op¬ 
eration  of  the  system.  Further  processing  is  necessary  to 
remove  the  portions  of  edges  that  extend  beyond  corners. 

3.1  Ridge  detection 

To  detect  ridges,  we  fit  a  suitable  function  to  small  windows 
of  Ee.  The  significance  of  an  edge  typically  drops  off  rapidly 
with  distance  from  the  underlying  edge,  falling  to  near  zero, 
then  rising  to  a  low  noise  level.  The  minimum  point  gen¬ 
erally  occurs  at  about  half  the  width  of  the  edge  operator. 
Figure  7  shows  an  example  from  a  synthetic  image  with  an 
intensity  edge  similar  to  that  in  Figure  5. 

We  use  a  quadratic  function  to  describe  the  ridge  shape. 

For  a  ridge  centered  at  (zo.jto)  with  orientation  d,  an  ap¬ 
propriate  functional  form  is 

Ee(x,  y)  =  <ip7(x,  y)  +  bp(x ,  y)  +  r,  (1 ) 

where  p(r,y)  is  the  perpendicular  distance  from  the  pixel 
at  (x,y)  to  the  line  centered  at  (xu,  Vo)  on  the  ridge  (see 
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Distance  from  edge 

Figure  7:  Intensity  edge  significance  as  a  function  of  per¬ 
pendicular  distance  from  the  underlying  edge.  Image  pa¬ 
rameters:  fiL  =  0.3,  fiR  =  0.5,  <r£  =  itr  =  0.15.  Operator 
parameters:  L  =  40,  W  =  20,  N  =  5.  Averaged  over  10 


images. 


Figure  5:  (a)  An  image  containing  an  intensity  edge  with 
a  30-degree  corner.  Pixel  intensity  means  are  0.3  on  bot¬ 
tom,  0.6  on  top;  variance  on  either  side  is  0.15  (range  0  to 
1).  (b)  Edge  significances  for  an  intensity  edge  operator  of 
length  20,  half-width  10,  orientation  —20  degrees.  Bright 
pixels  correspond  to  high  significance. 
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Figure  6:  Illustration  of  localization  phase  of  system. 


Figure  8:  A  ridge  detector  operator  with  length  L,  width 
W  and  orientation  <j>. 


Figure  8),  and  a,  b  and  c  are  the  fitted  parameters.  The 
distance  p(x,y)  is 


p(x,y)  =  |(*  -  io)sin<6  -  (y  -  y o)  cosd>| . 


(2) 


We  require  a  >  0  so  that  the  function  is  concave  up.  Param¬ 
eter  c  measures  the  height  of  the  ridge  at  its  center,  while  a 
and  6  determine  its  steepness.  Given  a  region  of  Ee  and  a 
hypothesized  position  and  orientation  for  the  ridge,  we  wish 
to  minimize  the  error  of  the  data  from  the  fitted  function. 
Standard  least-squares  regression  methods  are  used  to  find 
the  coefficients  a,  b  and  c,  and  are  not  detailed  here. 

The  ridge  detection  algorithm  searches  each  array  E$  for 
ridges  whose  height  c  is  over  a  threshold.  The  orientation 
of  such  ridges  is  found  iteratively  by  searching  for  the  ori¬ 
entation  that  results  in  the  minimum  error  of  fit.  Since 
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Eg  contains  information  for  edges  near  9,  the  orientation 
search  can  be  constrained  to  a  few  degrees  in  either  direc¬ 
tion  from  9.  The  details  of  the  ridge  detection  algorithm 
follow.  Suppose  the  orientation  difference  between  edge  sig¬ 
nificance  planes  is  a,  typically  about  30  deg. 

1.  (Find  local  maxima)  Search  through  Eg  for  local  (8- 
neighbor)  maxima  Eg( x,y)  such  that  Eg(x,  y)  >  7), 
where  T)  is  a  threshold.  Repeat  the  following  steps 
for  each  such  maximum  location  (xo,j/o). 

2.  (Find  best  ridge  fit)  Determine  the  ridge-function  pa¬ 
rameters  a,  b  and  c  for  ridges  centered  at  (xo,  yo),  for 
several  orientations  <t>  ranging  from  9— a/2  to  9+ a/ 2. 
For  each  such  4>,  the  fitted  function  considers  points 
Eg(u,v)  that  lie  inside  a  rectangular  window  with 
half-width  W,  length  L ,  and  orientation  <f>,  as  shown 
in  Figure  8;  L  is  the  same  as  the  length  of  the  edge 
operator  used  to  create  the  edge  significance  arrays, 
and  W  is  half  the  width  of  the  edge  operator.  For 
each  of  these  orientations  determine  the  error  of  fit. 
Let  <j> o  be  the  orientation  at  which  the  error  is  mini¬ 
mized,  and  co  be  the  height  of  the  fitted  ridge  there. 

3.  (Ignore  low  ridges)  If  co  <  T2  (where  Tj  is  another 
threshold,  the  minimum  ridge  height)  ignore  this 
ridge  as  insignificant.  Otherwise,  the  evidence  sug¬ 
gests  that  there  is  an  image  edge  centered  at  (xo,  J/o) 
with  half-length  L  and  orientation  <t>o-  Draw  such  a 
line  in  the  image  to  mark  it. 

Figure  9  illustrates  the  working  of  this  algorithm  on  the 
image  of  Figure  5.  Simple  intensity  edge  operators  (with 
L  =  20,  W  =  10,  N  =  4)  were  applied  to  the  image  at 
every  location,  at  six  orientations  10,  40,  70,  100,  130  and 
160  degrees,  forming  the  edge  significance  arrays  Eg.  (Note 
that  neither  of  the  image  edges  is  aligned  with  any  of  the 
edge  detectors  used;  also,  the  program  had  no  knowledge  of 
edge  position  or  orientation.)  The  edge  significance  arrays 
for  9  =  10  and  9  =  130  are  shown  in  Figures  9a  and  9b, 
with  bright  pixels  corresponding  to  high  significance;  the 
edge  significance  array  for  9  =  160  =  —20  appears  in  Fig¬ 
ure  5b.  Notice  that  the  ridges  follow  the  image  edges.  Fig¬ 
ure  9c  shows  the  ridges  found  by  the  above  algorithm,  for 
all  six  orientations.  This  algorithm  does  a  reasonable  job 
of  localizing  the  edges.  It  must  be  emphasized  that  any 
demonstration  that  purports  to  find  “the  edges”  in  an  im¬ 
age,  for  comparison  with  human  perception,  necessarily  in¬ 
volves  thresholds  on  edge  significance.  The  thresholds  for 
the  demonstrations  in  this  paper  were  adjusted  manually 
so  that  the  detected  edges  correspond  as  nearly  as  possible 
with  those  perceived  by  humans. 

3.2  Handling  corners 

Examination  of  Figure  9c  shows  that  the  detected  ridges 
extend  beyond  the  corner.  The  amount  of  overextension  is 
typically  about  1/3  the  length  of  the  ridge  detector  opera¬ 
tor,  but  depends  on  the  threshold  T2,  with  more  overexten¬ 
sion  as  Tj  decreases.  When  small  operators  are  used,  this 
overextension  can  often  be  ignored  (as  in  [8]),  but  when 
larger  operators  are  used,  overextension  becomes  more  of 


an  issue,  and  it  is  desirable  to  suppress  the  ridges  that  ex¬ 
tend  beyond  a  tangent  discontinuity  in  the  edge. 

The  simple  method  used  here  relies  on  the  fact  that,  at 
a  corner,  a  ridge  is  intersected  by  a  ridge  of  a  different  ori¬ 
entation,  forming  an  “/’’-shaped  pattern.  In  most  cases, 
removing  the  two  shortest  subridges  radiating  from  the  in¬ 
tersection  is  sufficient  to  suppress  the  overextended  ridges. 
Figure  9d,  computed  from  Figure  9c.  shows  the  results  of 
such  a  computation;  the  corner  is  now  more  accurately  lo¬ 
calized. 

3.3  Demonstration 

The  localization  algorithm  was  applied  to  the  textured  im¬ 
age  shown  in  Figure  10a,  a  foam  square  lying  on  an  image 
of  pigskin  from  Brodatz  [4].  In  the  halftoned  image  re¬ 
produced  in  this  paper  it  is  difficult  to  see  the  shadows  to 
the  left  and  right  of  the  foam  square  (due  to  multiple  light 
sources)  and  the  variations  in  texture  within  the  foam  it¬ 
self.  These  aspects  add  to  the  difficulty  of  segmenting  the 
image. 

To  summarize  the  segmentation  process,  the  texture  fea¬ 
tures  described  in  [11]  were  computed  over  the  upper  por¬ 
tion  of  the  image.  The  overall  significance  of  texture  edges 
was  computed  with  edge  operators  of  length  L  =  25,  width 
W  =  12;  seven  orientations  (10,  36,  62,  88,  114,  140,  166 
degrees)  were  used,  providing  the  edge  significance  arrays 
Eg.  Figure  10b  shows  the  results  of  the  edge  localization 
algorithm  after  corner  suppression. 

The  orientation  of  the  left  and  right  edges  of  the  square 
is  about  78°;  of  the  top,  about  168°.  The  top  edge  is  eas¬ 
ily  detected  since  there  is  a  sharp  texture  boundary,  no 
shadow,  and  the  orientation  of  the  edge  and  the  edge  oper¬ 
ator  at  166°  differ  by  only  2°.  The  orientations  of  the  left 
and  right  edges  fall  midway  between  the  available  edge  de¬ 
tector  orientations  (62  and  88  degrees),  making  them  more 
difficult  to  detect.  In  addition,  the  shadow  edges  are  within 
one  operator-width  of  the  foam/background  edge,  which  de¬ 
creases  the  edge  significance  for  the  main  edge.  As  a  result, 
the  left  and  right  edges  were  not  as  well  localized  as  the  top 
edge.  The  texture  edge  operators  are  quite  sensitive  to  tex¬ 
ture  differences,  and  detected  some  edges  within  the  foam 
square  and  near  its  boundaries  that  are  not  easily  visible  in 
the  halftoned  image  reproduced  here.  In  spite  of  these  dif¬ 
ficulties,  most  of  the  dominant  edges  bounding  the  square 
were  detected  and  localized. 

4  Summary  and  Conclusions 

This  paper  discussed  some  aspects  of  the  problem  of  edge 
localization,  and  proposed  a  simple  implementation  of  the 
ideas.  We  began  by  examining  the  localization  problem 
for  boundaries  of  very  simple,  abstract  textures.  We  saw 
that  even  in  this  simple  situation  the  localization  problem 
is  quite  difficult.  There  is  no  direct  way  to  estimate  the 
boundary  location  even  when  its  orientation  is  known;  the 
location  must  be  determined  iteratively  by  finding  the  po¬ 
sition  that  maximizes  the  significance  of  the  difference  be¬ 
tween  the  textures  on  either  side.  There  is  generally  no 
unique  position  that  maximizes  the  significance;  the  edge 
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Figure  9:  (a)  Edge  significanc  array  for  edges  in  Figure  5,  9  =  10.  (b)  Edge  significance  array  for  9  =  130.  (c)  Detected 
ridges;  thresholds  Ti  =  2.5,  Tj  =  4.1.  I  he  image  is  removed  for  clarity,  the  thin  lines  represent  the  underlying  edge, 
(d)  Results  after  removing  overextended  portions  of  ridges. 


can  only  be  localized  to  fall  within  a  region  whose  size  de¬ 
pends  on  the  density  of  the  texture  elements.  In  general,  the 
number  of  boundaries  and  their  position  and  orientation  are 
unknown,  and  we  encounter  difficult  issues  such  as  the  scale 
problem.  Localizing  edges  requires  evaluating  edge  opera¬ 
tors  at  many  positions  and  orientations,  and  interpreting 
these  edge  significances  to  obtain  a  coherent  description  of 
the  underlying  edges. 

A  simple  edge  localization  algorithm  was  proposed.  It 
takes  as  input  a  number  of  edge  significance  planes  Ee  which 
contain  at  each  point  ( x,y )  the  significance  of  an  edge  cen¬ 
tered  at  (x,  y)  in  direction  6.  Image  edges  are  manifested  as 
ridges  in  the  edge  significance  arrays,  with  the  orientation  of 
the  ridge  matching  the  orientation  of  the  image  edge.  The 
localization  algorithm  interprets  these  significance  arrays, 
searching  for  patterns  that  correspond  to  image  edges;  only 
directions  near  6  need  to  be  examined  for  ridges.  These 
edge  operators  respond  strongly  when  centered  laterally  on 
an  edge,  even  when  they  extend  beyond  the  termination 
of  the  edge.  A  simple  method  was  discussed  that  removes 
many  of  these  overextended  edges. 

The  algorithm  discussed  here  leaves  many  issues  unre¬ 
solved.  The  problem  of  measuring  curve  simplicity  in  a 
well-motivated  way  remains  open.  The  scale  problem  is  par¬ 
ticularly  difficult  but  has  received  little  attention  to  date. 
We  have  not  discussed  linking  of  the  detected  edgels;  for  a 
discussion  of  that  subject  see  [1,8]. 
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Abstract 

A  new  method  for  determining  local  surface  orientation 
from  the  autocorrelation  function  of  statistically  isotropic 
textures  is  introduced.  It  relies  on  the  foreshortening  that 
occurs  in  the  image  of  an  oriented  surface,  and  the  anal¬ 
ogous  foreshortening  produced  in  the  texture  autocorre¬ 
lation  function.  This  method  assumes  textural  isotropy, 
but  does  not  require  the  texture  to  be  composed  of  texels 
or  assume  texture  regularities  such  as  equal  area  texels  or 
equal  spacings  between  texels.  This  technique  is  applied 
to  natural  images  of  planar  textured  surfaces  and  found 
to  give  good  results  in  many  instances.  The  simplicity  of 
the  method  and  its  use  of  information  from  all  parts  of  the 
image  are  emphasized. 


Introduction 

An  important  task  that  arises  in  many  computer  vision  sys¬ 
tems  is  the  reconstruction  of  three-dimensional  depth  information 
from  two-dimensional  images.  Of  the  many  potential  depth  cues 
discussed  in  the  literature,  texture  might  be  expected  to  play  a 
particularly  central  role  in  the  processing  of  certain  classes  of  im¬ 
ages  such  as  those  ol  natural  outdoor  scenes.  Indeed,  a  variety 
of  shape-from-texture  algorithms  have  been  proposed[l). 

The  determination  of  surface  orientation  from  textural  cues 
has  been  based  on  two  general  techniques.  Gradient  methods, 
exemplified  by  the  work  of  Gibson[2],  rely  on  changes  in  texture 
properties  such  as  the  density  of  texels  as  the  surface  recedes  from 
the  observer.  It  is  also  possible,  however,  to  deduce  surface  orien¬ 
tation  from  purely  local  properties  of  the  observed  texture.  Our 
work  follows  this  latter  approach,  and  bears  many  similarities  to 
the  algorithm  presented  in  the  pioneering  paper  of  Witkin[3]. 

Witkin  begins  with  the  assumption  that  the  texture  is  isotropic, 
that  is,  that  statistically  speaking  the  texture  has  no  inherent  di¬ 
rectionality.  The  distribution  of  edge  directions  obtained  from 
such  a  texture  will  therefore  be  flat.  If,  however,  the  textured 
surface  is  viewed  obliqueiy,  projective  foreshortening  — a  purely 
local  phenomenon  —  distorts  thr  distribution  in  a  well-defined 
way.  Witkin  thus  proposes  that  a  histogram  of  edge  directions 
constructed  from  the  image  in  question  be  used  to  determine 
surface  orientation  via  a  maximum  likelihood  fit. 

‘This  work  was  supported  in  part  by  DARPA  grant  #  N00039-S4-C-0J65. 
One  of  us  (H.S.)  held  a  Wei z man n  fellowship  during  the  course  of  this  work. 


The  method  proposed  here  is  also  based  on  the  assumption 
of  textural  isotropy,  but  uses  the  projective  distortion  of  the  tex¬ 
ture  autocorrelation  function  as  an  orientation  cue  rather  than 
the  projective  distortion  of  Witkin’s  edge-direction  histogram. 
The  autocorrelation  of  an  oriented  texture  is  foreshortened  in  a 
way  identical  to  the  foreshortening  of  the  image  itself,  and  the 
amount  and  direction  of  the  foreshortening  are  conveniently  mea¬ 
sured  by  the  moments  of  the  autocorrelation  function.  Potential 
advantages  of  this  approach  are  the  elimination  of  the  arbitrari¬ 
ness  associated  with  the  choice  of  edge-detection  algorithm  and 
the  fact  that  it  uses  information  from  all  parts  of  the  texture 
rather  than  just  the  edges. 

The  next  section  contains  a  quantitative  treatment  of  projec¬ 
tive  foreshortening  of  texture  autocorrelation,  and  culminates  in 
a  formula  expressing  surface  orientation  as  a  function  „f  the  mo¬ 
ments  of  the  texture  autocorrelation.  The  third  section  provides 
the  practical  details  of  an  algorithm  based  on  this  approach,  and 
describes  the  results  it  yielded  when  applied  to  textures  found  in 
common  outdoor  scenes. 

Foreshortening  of  Texture  Autocorrelation 

Let  us  begin  with  the  terminology  to  be  used  in  this  and  the 
subsequent  section.  The  orientation  of  a  surface  (with  respect  to 
the  line  of  sight)  will  be  given  in  terms  of  the  slant  and  tilt  pa¬ 
rameters,  (a,  r),  introduced  by  Witkin[3].  The  slant  of  a  surface, 
<7,  is  defined  as  the  angle  between  the  normal  to  the  surface  and 
the  line  of  sight,  That  is,  a  is  the  amount  the  surface  slants  away 
from  being  parallel  to  the  image  plane.  We  will  take  a  to  run  be¬ 
tween  0°  and  90°.  The  tilt,  r,  specifies  the  direction  in  which  the 
surface  is  slanted,  and  is  defined  as  the  angle  between  the  x-axis 
of  the  image  plane  and  the  projection  into  the  image  plane  of  the 
normal  to  the  surface.  We  will  take  r  to  run  between  -180°  and 
180°.  For  example,  a  surface  parallel  to  the  image  plane  has  zero 
slant  and  an  undefined  tilt. 

Let  an  image  be  specified  by  a  gray-scale  function  F(r), 
where  r  =  (r[,r2)  denotes  a  point  in  the  image  plane.  The  image 
of  a  textured  plane  that  is  viewed  head  on,  i.e..  that  is  perpen¬ 
dicular  to  the  line  of  sight,  will  be  denoted  by  F±(f).  F^,r)(f) 
is  then  defined  as  the  image  produced  by  the  same  plane  when 
it  is  given  orientation  (<r,  r)  with  respect  to  the  line  of  sight . 

The  autocorrelation  of  an  image,  A(f ),  is  defined  as 

A(  r)  =  j  (F(r')  —  F)  (F(r'  +  r )  —  F)  dr',  (1) 

whore  F  is  l lie  moan  of  F.  The  autocorrelation  is  conventionally 
normalized  such  that  /l(0)  =  1.  hut  our  purposes  do  not 
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require  a  specific  choice  of  normalization  to  be  made.  The  head- 
on  autocorrelation,  -dj.,  and  the  oriented  autocorrelation,  d(a  T), 
correspond  to  the  images  F±  and  respectively. 

Finally,  we  introduce  the  autocorrelation  moment  matrix1. 


defined  as 


n=(m  Ml2j, 

V  M21  P22  ) 

i Hi  =  J  r,r}A(r)dr. 


The  moment  matrix  is  trivially  symmetric;  pt2  =  fin.  Since 
the  region  of  integration  in  equation  (2)  is  notionally  infinite  (al¬ 
though  it  corresponds  in  practice  to  a  finite  sum  over  pixels)  we 
require  A(r)  to  fall  off  sufficiently  rapidly  at  large  distance  that 
the  moments  be  well-defined.  The  randomness  found  in  natural 
textures  suffices  to  produce  the  desired  behavior.  A  highly  reg¬ 
ular  pattern,  however,  can  have  an  autocorrelation  that  remains 
large  at  large  distance;  the  algorithms  presented  here  cannot  be 
applied  directly  in  such  a  situation.  The  autocorrelation  moment 
matrices  derived  from  the  image  of  a  textured  plane  viewed  head 
on  and  the  same  plane  with  orientation  ( <7 ,  r)  will  be  denoted 
and  /<(„.,■),  respectively. 

We  now  discuss  how  projective  distortion  affects  the  autocor¬ 
relation  moment  matrix  of  an  image,  and  how  this  information 
can  be  used  to  determine  the  orientation  of  a  textured  plane.  Let 
F±(r)  be  the  head-on  image  of  a  textured  plane.  Now  consider 
the  image  produced  by  the  same  planar  surface  when 
viewed  obliquely  with  slant  and  tilt  parameters  (<7,r).  We  make 
the  simplifying  assumption  that  the  distance  between  the  plane 
and  observer  is  very  large  in  comparison  with  the  linear  size  of 
the  portion  of  the  plane  under  consideration.  The  image  of  the 
oriented  plane  is  therefore  given  by  orthographic  projection: 

F^T)(r )  =  Fi(M-V,r)r),  (3) 


<7  -  1 )  cos2  t 


)cos  r  sin  r  1  +  (cos  <7 


(cos <7—  1 ) cosr  sin  r\ 

1  +  (cos  <7  -  1)  sin2  r  )  ' 


Notice  that 


M(<7,  r)  =  RrM(cr,0)R-r,  (5a) 

with 

RT=(COSr  (56) 

V  sin  t  cos  t  ) 

and 

m K°)=(cy  ;).  (50 

That  is.  M(rr,  r)  is  the  matrix  which  foreshortens  the  vector  r  by 
a  factor  of  cos <7  along  the  direction  of  tilt,  r,  while  leaving  the 
direction  perpendicular  to  t  unchanged.  We  should  emphasize 
that  M(rr,r  +  180°)  (or  equivalently  M(-<7,r))  and  M(tr,r)  are 
equal.  It  is  clear,  therefore,  that  our  method  will  yield  surface 
orientation  only  up  to  an  overall  7  —*  r  +  180°  ambiguity. 

The  image  autocorrelation  transforms  identically  to  the  image 
under  orthographic  projection: 

A(„  r)(r  )  =  Ax(M_l(f7,r)r  ).  ((>) 

’The  use  of  moments  in  vision  processing  has  been  discussed  by  I1 11  in 
reference  [4],  We  note  that  our  notations  differ  the  correspondence  is 
given  by  (/*,,,  =  /iai ,  /«aa)  *-  (mao.  mu  ,  moj). 


To  relate  A*(<rjT)  to  we  evaluate  the  integral  in  equation  (2) 
via  the  change  of  variables  f  — ►  r 1  =  M(cr,  r)r,  and  (using  the 
result  detM(a,r)  =  cos cr)  obtain 

/'(a.r)  =  cos<7M(<7,r)/rxM(<7,r). 

In  order  to  extract  the  slant  and  tilt  parameters  from  p  some  a 
priori  information  about  the  texture  is  required.  In  this  work  we 
will  assume  that,  in  a  statistical  sense,  the  texture  is  isotropic,2 
and  hence  that  the  moment  matrix  is  a  multiple  of  the  identity, 
pi  =  £1.  Given  this  assumption,  we  now  have 

/2(a,r)  =  cM2((7,r),  (7) 

where  the  constant  is  given  by  c  =  £  cos  a. 

Equation  (7),  which  relates  the  autocorrelation  moment  ma¬ 
trix  of  the  image  of  a  (directionally  homogeneous)  textured  sur¬ 
face  to  the  slant  and  tilt  parameters  that  specify  its  orientation, 
forms  the  basis  of  our  shape-from-autocorrelation  algorithm.  We 
complete  our  discussion  of  the  underpinnings  of  this  algorithm 
by  giving  the  slant  and  tilt  parameters  as  explicit  functions  of 
the  matrix  P(»,r)-  From  equation  (5c)  cos2  <7  is  given  by  the  ratio 
of  the  smaller  to  the  larger  eigenvalue  of  /i(cr,  r)  ex  M2(<7,  t): 


cr  =  arccos , 


Pn  +  P22  ~  \A/‘n  ~  Vii)2  +  V12 
^li  +  P22  +  \Apti  ~  M22)2  +  4p  12 


The  expression  for  the  tilt  parameter,  r.  follows  from  equation 

(4): 

t—  i-arctan- — — ,  (86) 

1‘H  -  P22 

To  summarize:  equation  (8),  our  principal  result,  gives  the 
orientation  of  a  textured  surface  as  a  function  of  its  autocorrela¬ 
tion  moment  matrix,  provided 

•  textural  isotropy  is  assumed, 

•  the  autocorrelation  falls  off  sufficiently  rapidly  with  dis- 


•  orthographic  projection  is  assumed,  and 

•  additional  information  is  available  to  resolve  the  r  *-*  r  + 
180°  ambiguity. 


Implementation  and  Results 


To  test  this  method,  images  were  made  of  textures  that  seemed 
reasonably  isotropic.  In  each  case  a  single  planar  surface  was  pho¬ 
tographed  from  a  sufficient  distance  that  orthographic  projection 
was  a  good  approximation,  and  that  the  entire  image  consisted 
of  an  image  with  a  single  orientation.  We  used  textures  com 
monly  found  in  natural  outdoor  scenes,  such  as  grass,  rocks,  dirt 
ami  leaves.  Because  of  their  practical  importance  in  navigation, 
images  of  roads,  sidewalks,  and  pebbled  pathways  were  also  in¬ 
cluded. 

2lt  should  hr  clear  that  this  assumption  n red  not  be  strictly  satisfied 
It  suffices  to  make  the  weaker  assumption  that  fi  L  —  £1,  that  is,  that  any 
directional  inhomogeneities  the  texture  may  have  are  not  of  tin*  sort  that 
mimic  an  oblique  viewing  angle,  and  hence  do  not  show  up  in  the  moment 
matrix. 
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The  following  simple  scheme  permitted  the  actual  orientation 
to  be  conveniently  determined.  Two  photographs  of  each  surface 
were  taken,  identical  except  that  in  one  of  the  pictures  a  flat 
circular  object  was  placed  on  the  surface.  The  picture  without 
the  object  was  used  as  input  to  our  shape-from-autocorrelation 
algorithm,  while  measurements  of  the  foreshortened  image  of  the 
circular  object  in  the  second  picture  gave  the  true  surface  orien¬ 
tation. 

The  photographs  were  digitized  to  yield  256  x  256  8-bit  gray¬ 
scale  images,  and  the  autocorrelation  was  computed  as  the  Fourier 
transform  of  the  power  spectrum  of  the  image.  Since  the  Fourier 
transform  can  be  computed  with  O(nlogn)  cost,  where  n  is  the 
size  of  the  number  of  pixels  in  the  image,  and  the  moments  com¬ 
puted  with  0(n)  cost,  the  entire  procedure  has  a  complexity  of 
0(n  log  n).  Thus,  for  a  small  cost  above  the  optimal,  the  method 
uses  all  the  information  in  the  image. 

When  computing  the  moments  the  integration  in  equation 
(2)  is  replaced  by  a  summation,  with  the  autocorrelation  nomi¬ 
nally  being  summed  over  all  possible  separations.  To  the  extent 
that  the  autocorrelation  falls  off  rapidly,  the  result  should  be  in¬ 
sensitive  to  the  precise  region  of  summation.  However,  realistic 
images  do  not  obey  this  assumption  in  a  robust  way,  and  statis¬ 
tical  noise  in  the  autocorrelation  at  large  separations  can  cause 
excessive  error  in  the  values  of  the  moments.  To  alleviate  this 
problem  we  sum  only  over  autocorrelations  which  exceed  some 
threshold  value,  effectively  restricting  the  moment  sum  to  rela- 
tively  small  separations.  The  threshold  value  is  chosen  to  be  the 
value  of  the  autocorrelation  averaged  over  a  ring  with  an  empiri¬ 
cally  determined  radius.  In  our  experiments,  a  radius  of  10  pixels 
was  found  to  give  good  results.  Figure  1  shows  how  this  radius 
effects  the  slant  and  tilt  estimates  for  four  typical  images  (D,H.I 
and  K).  For  each  image  the  ‘O'  marks  the  radius  value  in  which 
the  computed  (a,  r )  matches  the  actual  value.  As  seen  from  these 
figures  the  threshold  radius  must  be  chosen  intelligently,  but  the 
choice  is  not  particularly  delicate  —  any  threshold  radius  in  the 
range  ~  5  to  25  would  yield  similar  results. 

A  summary  of  the  results  is  shown  in  Figure  2.  In  this  plot 
each  orientation  is  represented  by  a  point  in  polar  coordinates 
where  the  slant  is  the  distance  from  the  center  and  the  tilt  is  the 
angle.  For  each  image  ( A  -  K )  the  error  between  the  actual  mea¬ 
surement  of  the  surface  orientation  and  the  value  predicted  by  the 
method  is  shown.  Considering  the  difficulty  in  discerning  orien¬ 
tations  of  homogeneously  textured  surfaces  without  perspective 
or  world-knowledge  rues,  all  of  the  results  except  images  F  and 
D  were  very  reasonable. 

The  four  representative  images  illustrate  the  technique  and 
its  behavior  in  more  detail  in  Figure  3.  The  large  picture  in 
the  upper  left  hand  corner  of  each  of  the  four  examples  shows 
a  512  X  512  color  image  of  the  textured  surface  with  the  flat 
circular  object  laying  upon  it.  As  was  just  discussed,  from  the 
foreshortening  of  the  circular  object  in  this  image,  the  actual  sur¬ 
face  orientation  was  determined.  The  picture  in  the  upper  right 
hand  corner  shows  the  256  x  256  black  and  white  subimage  of  the 
larger  image  without  the  circular  object.  It  is  the  autocorrelation 
of  this  image  that  was  used  to  compute  an  estimate  of  the  surface 
orientation.  The  actual  and  computed  orientations  are  given  at 
the  bottom  of  each  figure.  The  picture  in  the  lower  right  cor 
ncr  shows  the  central  61  X  61  region  of  the  autocorrelation.  The 
predicted  foreshortening  of  the  autocorrelation  is  clearly  seen, 
and  the  elliptic  shape  of  the  foreshortened  autocorrelation  corre 
sponds  closelv  to  the  foreshortened  image  of  the  circular  object 


placed  on  the  surface.  It  should  be  observed,  that  from  the  pic¬ 
tures  used  for  the  computation,  it  is  virtually  impossible  for  most 
people  to  perceive  the  surface  orientation  while  nevertheless  the 
method  is  often  successful.  Even  in  the  road  example  where  the 
predicted  orientation  is  poor,  the  autocorrelation  shows  the  ex¬ 
pected  properties. 

Concluding  Remarks 

We  have  developed  a  new  shape- from-texture  algorithm  that 
uses  the  effects  of  foreshortening  on  the  autocorrelation  of  an 
image.  The  method  assumes  textural  isotropy,  that  is,  that  the 
statistical  properties  of  the  texture  have  no  inherent  directional¬ 
ity,  and  that  the  texture  is  orthographically  projected  onto  the 
image  plane.  The  performance  of  this  algorithm  is  quite  good; 
the  empirical  tests  presented  in  the  previous  section  show  that 
this  algorithm  gives  good  estimates  of  surface  orientation  when 
applied  to  a  variety  of  natural  textures.  This  method  is  simple 
and  well-defined,  and  avoids,  for  example,  assumptions  concern¬ 
ing  texcls  and  the  arbitrariness  associated  with  edge-detection. 
A  particular  advantage  of  this  approach  is  its  use  of  informa¬ 
tion  from  all  parts  of  the  image;  all  pixels  contribute  to  the  final 
orientation  estimate. 

Future  extensions  of  this  method  might  include  weakening  the 
assumption  of  textural  isotropy  through  the  use  of  multiple  views, 
techniques  involving  gradients  of  the  texture  autocorrelation,  or 
use  of  a  knowledge  system  to  provide  a  priori  information  about 
the  head-on  texture  autocorrelation.  Surface  curvature  should  be 
measurable  by  applying  the  method  patch-wise  over  an  image. 
A  parallel  implementation  using  image  shifting  to  compute  the 
relevant  portion  of  the  autocorrelation  should  lead  to  substantial 
increases  in  the  speed  of  the  algorithm. 

Acknowledgments 

We  would  like  to  thank  John  Kender  for  his  support  and  en¬ 
couragement.  We  would  also  like  to  thank  Monnett  Hanvey  for 
the  use  of  her  photographic  equipment  and  expertise. 

References 

[1]  .1.  Aloimotios  and  M.J.  Swain  ‘‘Shape  from  Texture,”  Proc. 
of  the  Tenth  International  Joint  Conf.  on  A.I..  I.1CAI. 
(1985). 

11.  llajscy  and  L.Liel  erman  “Texture  Gradient  as  a  Depth 
Cue."  Computer  Graphics  and  Image  Processing  5  (1976), 
52-67. 

K.  Ikouchi  “Shape  from  Regular  Patterns  (an  Example  from 
Constraint  Propagation  in  Vision)."  Proc.  of  the  Interna¬ 
tional  Conf.  on  Pattern  Recognition  (1980).  1032  1039. 

.1.11.  Kender  She  js  from  Trrturc,  PhD  thesis.  CMC  1980. 

V.  011  la.  K.Maenoliu  an  !  T.  Sakai  “Obtain  Su  rface  Orient  a 
tion  from  Texels  under  Perspective  Projection,"  Proc.  of  the 
Seventh  International  Joint  Conf.  on  A. I..  IJCA1.  ( 1981 ). 

[2]  .1  Gibson  Th <  Perception  of  the  Visual  World.  Cam 
bridge.  MA:  Riverside  Press.  1950. 


[3]  A.  P.  VVitkin,  “Recovering  Surface  Shape  and  Orientation 
from  Texture,”  Artificial  Intelligence  17  (1981),  17-45. 

[4]  M.  K.  Hu,  “Visual  Pattern  Recognition  by  Moment  Invari¬ 
ants,”  IRE  Trans.  Inform.  Theory  IT-8  (Feb.  1962),  179- 
187. 


0  50  100 

Radius 


Figure  1.  Effects  of  Thresholding  Radius  on  the  Elimination 
of  Autocorrelation  Noise  for  Four  Typical  Images.  The  small 
circle  marks  the  radius  value  in  which  the  computed  slant  or  tilt 
matches  the  actual  value. 


Figure  2.  Polar  Representation  of  Error  Between  Actual  and 
Computed  Surface  Orientation  for  Eleven  Pictures.  The  slant  is 
the  distance  from  the  center  and  the  tilt  is  the  angle. 


Actual  slant=71,  tilt=23 
Computed  slant=58,  tilt=25 


Actual  slant=58,  tilt=-16 
Computed  slant=53,  tilt=-17 


Figure  3.  Results  of  Autocorrelation  Method.  The  large  picture  in  the  left  hand  corner  of 
each  of  the  four  figures  shows  a  512  x  512  color  image  of  the  textured  surface  with  the  flat 
circular  object  laying  upon  it  from  which  the  actual  surface  orientation  was  determined. 
The  picture  in  the  upper  right  hand  corner  shows  the  256  x  256  black  and  white  subimage 
of  the  larger  image  without  the  circular  object  which  was  used  to  compute  an  estimate  of 
the  surface  orientation.  The  picture  in  the  lower  right  corner  shows  the  central  64  x  04 
region  of  the  autocorrelation. 
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Abstract 

Recognizing  that  a  polyhedral  image  is  a  segmentation  of  the 
plane  into  connected  regions,  called  faces,  joined  by  vertices 
and  edges,  the  input  image  is  represented  as  a  planar  network 
of  nodes  (vertices,  edges,  faces)  linked  by  adjacency  links.  By 
introducing  junction-pairs  and  junction-loops  as  labelings  of 
edges  and  faces,  globally  consistent  labelings  of  the  whole  image 
can  be  found  through  achieving  node  and  arc  consistencies  at 
all  nodes  and  links  in  the  network.  This  proves  that  labeling 
of  polyhedral  images  and  line  drawings  is  linear  in  the  number 
of  edges  in  the  image,  and  exponential  only  in  the  maximum 
number  of  edges  intersecting  at  a  node.  Test  examples  on  ideal 
and  real  polyhedral  images  are  illustrated. 


1.  Image  Segmentation 

I'he  hidden  line  view  of  a  finite  3D  polyhedral  scene  is  a  fi¬ 
nite  2D  image.  If  the  hidden  line  view  is  completely  inside 
the  image  boundary,  then  the  closed  object  faces  project  into 
zero  or  many  dosed  image  faces.  The  other  faces  of  the  image 
must  come  from  the  background  of  the  3D  scene.  Only  the 
face  surrounding  the  hidden  line  view  is  open  and  needs  to  be 
bounded  by  the  image  boundary.  All  the  other  faces  are  dosed 
by  the  edges  in  their  boundaries.  If  the  image  frame  overlaps 
the  hidden  line  view,  then  some  edges  and  faces  of  the  image 
are  cut  off  and  bounded  by  the  image  boundary.  Th"  segments 
and  end  points  that  are  on  this  boundary  cannot  be  considered 
as  valid  junctions  because  the  faces  around  them  are  not  com¬ 
pletely  defined.  So.  we  must  exclude  the  image  boundary  from 
labeling  and  segmentation. 

Fares,  edges,  and  vertices  are  connected  sets  in  Rl.  If  we 
exclude  the  image  boundary,  then  we  can  define  the  interior 
component  of  a  set  S  as  the  set  of  points  of  .S'  not  on  the 
boundary  OS'  and  not  on  the  image  boundary  0/.  For  example, 
the  interior  of  a  vertex  inside  01  is  a  point.  The  interior  of 
an  edge  (resp.  face)  is  an  open  and  connected  segment  (resp. 
region).  We  note  that  no  point  of  the  image  is  in  the  interior 
of  two  distinct  topological  sets,  or  formally: 

Theorem  1  The  disjoint  union  of  tin  inti  rior  compounds  of 

.oiio  s.  tdyts,  null  jiict  s  of  Ho  image  is  np.ot  to  thi  open 
srt  inside  thi  image  boundary. 


The  image  ia  augmented  into  subsets  of  ,72.  wh’uh  have  ad 
jacent  but  non-overlaping  interior  components.  This  segmen¬ 
tation  ran  be  represented  by  a  planar  network  of  linked  nodes. 
I’he  nodes  are  the  vertices,  edges  and  fares,  or  topological  sub 
sets  of  the  image.  The  links  describe  the  adjacency  relationship 
between  these  topological  sets. 

2.  Node  Consistency 

Consider  the  following  scenario.  We  cut  n  distinct  pictures 
with  one  single  template  to  create  n  puzzles  and  pile  them  on 
top  of  each  other.  The  template  cuts  a  picture  into  m  pieces, 
so  there  are  m  piles.  Each  pile  has  n  pieces,  all  with  the  same 
shape  but  with  different  picture  segments  on  them.  Suppose 
we  shuffle  the  n  pieces  within  each  pile,  then  we  remove  at 
random  a  few  pieces  from  each  pile.  You  are  now  asked  to  find 
all  complete  pictures  from  the  m  piles.  Waltz  (1975)  showed 
that  the  most  efficient  way  to  tackle  this  task  is  to  compare 
pieces  from  one  pile  against  pieces  from  its  adjacent  pile,  and 
prune  away  all  pieces  that  can  not  possibly  match.  You  stop 
when  there  is  one  pile  that  is  empty,  in  which  rase  there  is  no 
complete  picture,  or  when  there  is  no  further  pruning,  in  which 
case  you  have  found  all  the  complete  pictures.  Each  complete 
picture  can  be  recovered  by  picking  the  first  piece  from  any 
starting  pile,  then  find  the  adjacent  pieces  that  match  the  first 
piece  from  each  pile  adjacent  to  the  first  pile,  and  so  on.  until 
we  cover  all  the  pieces  in  the  template. 

Faking  the  hint  from  the  above  scenario,  our  task  is  to  de¬ 
fine  local  labelings  at  each  node  of  the  image,  and  prove  t hat 
global  consistency  can  be  achieved  from  node  and  arc  consis¬ 
tencies  at  ail  nodes  and  links  of  the  image.  Huffman  (1971) 
and  Clowes  (1971)  first  introduced  junctions  as  local  labelings 
at  vertices.  Representing  as  a  pair  or  triple  of  edge-labels,  a 
junrlion  completely  describes  the  local  volume  at  a  trihedral 
vertex  (I.,  Fork.  Arrow)  or  an  occlusion  (T).  The  local  label¬ 
ing  at  an  edge  E  must  completely  describes  the  local  volume 
around  edge  E  up  to  its  two  end  vertices  ami  its  two  adjacent 
faces.  The  obvious  choice  is  a  set  of  edge-labels  for  edge  E 
and  all  the  edges  intersecting  E.  To  facilitate  arc  consistency 
between  an  edge  and  its  end  vertices,  we  choose  to  represent 
the  local  labeling  at  an  edge  by  a  junction-pair.  Similarly,  the 
local  labeling  at  a  face  /  is  a  junction-loop,  which  contains 
edee-labelc  for  all  the  he.,:Arjrv  otres  in  tip  and  all  *hr  e-m, 
intersec  ting  the  face  boundary  OF. 
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Figure  1:  Label  hierarchies  for  vertices,  edges,  and  faces. 

Figure  1  show  the  label  hierarchies  for  the  vertices,  edges, 
and  faces.  Figure  2  show  examples  of  junction,  junction-pair, 
and  junction-loop  which  completely  describe  the  local  volume 
at  a  vertex,  edge,  and  face.  The  junctions  for  occlusion  and 
trihedral  vertices  (Huffman  1971,  Clowes  1971)  are  precom¬ 
puted  and  stored  in  a  dictionary.  Junctions  and  vertex-labels 
are  related  by  a  C  relation,  which  describes  node  consistency 
at  vertices.  The  junctions  for  multihedral  vertices  with  zero  or 
one  hidden  face  can  be  enumerated  depth  first  from  the  Funda¬ 
mental  Junction  Equation  relating  the  orientation  of  the  edges, 
the  ordering  of  the  faces,  and  their  respective  gradient  points 
around  a  vertex  (Malik  1985). 

The  junction-loops  of  a  face  F  are  found  by  a  depth-first 
search  for  all  consistent  labelings  of  the  vertices  around  the  face 
boundary  6F.  Consistency  means  that  the  junctions  of  any  two 
consecutive  vertices  must  have  the  same  edge-label  for  the  edge 
connecting  these  two  vertices.  Each  junction  loop  found  from 
depth-first  search  must  be  checked  to  see  if  the  edge-labels  of 
the  boundary  edges  are  all  consistent  with  some  face-label  of 
F .  This  later  is  node-consistency  enforced  by  the  C  relation, 
relating  junction-loops  and  face-labels.  The  enumeration  of  the 
junction-pairs  also  requires  a  depth-first  search  for  all  pairs  of 
junctions  that  have  the  same  edge-label  for  the  common  edge. 

It  is  well  known  that  there  are  no  dangling  edges  and  ver¬ 
tices  in  a  projection  of  a  polyhedral  scene.  Each  edge  has  two 
distinct  faces,  and  each  vertex  has  as  many  faces  as  edges  joined 
to  it.  So  the  number  of  vertex-edge,  edge  face,  and  face- vertex 
links  is  each  at  most  twice  the  number  of  edges,  and  ‘he  num- 
1..  i  of  node-  ‘he  image  is  at  most  three  times  the  number  of 
edges.  With  a  complete  search  at  each  node  to  enumerate  all 
the  local  labelings  that  are  node-consistent,  we  deduce  that: 
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Figure  2:  Junctions,  junction-pairs,  and  junction-loops  are  re¬ 
spectively  local  labelings  around  a  'ortex,  an  edge,  and  a  face. 


Complexity  1  The  enumeration  of  all  local  labelings  at  all 
nodes  has  time  complexity  linear  in  the  number  of  edges  in  the 
image  and  exponential  in  the  maximum  number  of  edges  inter¬ 
secting  at  a  node. 

3.  Arc  Consistency 

Constraints  at  ail  nodes  are  captured  in  the  enumeration  of 
all  local  labelings  that  satisfy  node  labels.  These  constraints 
are  propagated  over  all  links  by  pruning  adjacent  local  label¬ 
ings  that  can  not  possibly  match.  Let’s  describe  the  constraint 
satisfaction  ana  propagation  (CSP)  over  vertex-edgc-tace  links, 
after  all  the  local  labelings  are  enumerated. 

Algorithm  1  (Parallel  CSP  over  vertex-edge-face  links) 

Initialize  Vertex-Fifo,  Edge-Fifo,  and  Face-Fifo  respectively  with  all 
vertices,  edges,  and  faces  in  the  image. 

Loop  until  (and  (emptyp  Vertex-Fifo)  ( empty p  Edge-Fifo)  (emptyp 
Face-Fifo))  do 

1.  Loop  until  (emptyp  Vertex-Fifo)  do 
Let  V'  :=  (pop  Vertex-Fifo). 

a.  Loop  for  all  edge  Ex  in  Edges(V)  do 

Restrict  junction-pairs  of  F,  with  junctions  of  V". 

If  the  set  of  junction-pairs  of  edge  F,  has  been  restricted 
then  (insert  E,  Edge-Fifo). 

b.  Loop  for  all  face  E,  in  Faces(V)  do 

Restrict  junction-loops  cf  Ft  with  junctions  of  V'. 

If  the  set  of  junction-loops  of  face  F,  has  been  restricted 
then  (insert  F,  Face-Fifo) 

2.  Loop  until  (emptyp  Edge-Fifo)  do 
Let  E  (pop  Edge-Fifo). 

a.  Loop  for  all  face  f\  in  Faces(E)  do 

Restrict  junction -loops  of  F,  with  junction- pairs  of  E 
If  the  set  of  junction-loops  of  face  F,  has  been  restricted 
then  (insert  /,  frace-Hfo). 

h.  Loop  for  all  vertex  V,  in  Vcrlices(K)  do 

Restrict  junctions  of  V’,  with  junction-pairs  of  E 
If  the  set  of  junctions  of  vertex  Vi  lias  been  restricted 
then  (insert  1‘,  Vertex-Fifo). 


» 'fry#.  J 


& 


3  Loop  until  (empty p  Face- Fifo)  do 
Let  F  :=  (pop  Face- Fifo). 

a.  Loop  for  all  vertex  V,  in  Vertices(F)  do 
Restrict  junctions  of  V',  with  junction- loops  of  F . 

If  the  set  of  junctions  of  vertex  Vt  has  been  restricted 
then  (insert  Vx  Vertex- Fifo). 

b.  Loop  for  all  edge  £j  in  Edges(F)  do 

Restrict  junction-pairs  of  E,  with  junction-loops  of  F. 

If  the  set  of  junction-pairs  of  edge  £j  has  been  restricted 
then  (insert  Ex  Edge- Fifo). 


Algorithm  2  (Restrict  junctions  of  V  with  junction-pairs  of  E) 
Loop  for  all  J  in  Junctions(V)  do 

if  Loop  for  ail  P  in  J  unction- Pairs(E) 
never  J  C  arcf(P ,  V) 
then  if  Loop  for  all  P  in  Junction-Pairs(E) 
thereis  aref(P ,  V)  C  J 
then  replace  J  by  its  sub-junctions 
else  delete  J  from  Junctions( V). 

Complexity  2  Let  mj.mp.m^  be  the  maximum  number  o} 
junctions,  junction- pairs,  junction-loops  at  any  vertex,  edge , 
face  in  the  image.  Without  the  hierarchy  of  edge-labels,  the 
time  complexity  of  constraint  satisfaction  and  propagation  otn  r 
vertex-edge-face  links  of  a  jx)lyhedral  image  I  is: 

^  f  ( mP  +  mL  mL  +  mj  mj  +  mr\  \ 

O  (  mj  mp  mi  (  -  -+  -  +  -  \Edges(l)\ 

\  \  rnj  mr  mL  J  J 

If  there  is  no  hierarchy  of  edge-labels,  the  consistency  rela¬ 
tion  is  =  instead  of  C,  and  there  is  no  label  refinement.  The  set 
of  labelings  have  the  worst  case  cardinality ,  and  Algorithm  2 
has  time  complexity  0{mjmp)  time.  We  note  that  the  num¬ 
ber  of  edge- vertex  arcs  is  twice  the  number  of  edges  in  the 
image.  Each  edge  can  be  in  Edge  Fifo  mp  times,  so  in  the 
worst  case  the  constraint  procedure  in  Algorithm  2  is  called 
at  most  2mp\Fdges{I)\  times,  and  the  edge- vertex  arc  consis¬ 
tency  of  Loop  2.b  costs  0{mf>mj\Edgcs{ /)|)  time.  Similarly, 
we  prove  that  the  other  arc-consistencies  have  time  complexity 
linear  in  the  number  of  edges  in  the  image,  and  polynomial  in 
the  number  of  local  labelings.  | 


4.  Global  Consistency 

We  have  shown  how  the  constraints  are  sat  is  fieri  at  each  node 
and  to  the  neighbors  using  node  and  arc  consis¬ 

tencies.  A  node  is  locally  consistent  if  and  only  if  it  has  local 
labelings  that  satisfy  node  consistency  and  ail  consistencies  ^ 
with  neighboring  node's.  Let’s  now  prove  that  local  consistency 
at  ail  nodes  implies  global  consistency,  or  formally: 

Theorem  2  After  ( \SP,  the  polyhedral  image  l  has  globally 
consistent  labcltngs  tf  and  only  ./  nP  t «•#'/««**  edges  and 
faces  of  /  have  locally  consistent  labelings. 

We  show  by  induction  that  we  can  enumerate  all  globally 
consistent  diagrams  of  the  image.  The  proof  will  highlight  the 
necessity  of  local  constraints  captured  by  junction-pairs  and 
junction-loops. 
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Figure  3:  Labeling  a  face  from  its  boundary  vertices  and  edges. 


Figure  4:  Labeling  many  faros  joined  by  common  vertices  and 
edges. 


•  If  the  vertices  and  edges  of  a  face  boundary  have  con¬ 
sistent  labelings,  then  the  face  boundary  has  at  least  one 
consistent  labeling.  Starting  from  a  vertex  or  an  edge, 
let’s  walk  clockwise  keeping  the  face  on  our  right-hand 
side  (Figure  3). 

Let’s  pick  a  junction  J\  at  vertex  Vj  from  its  set  of  junc¬ 
tions.  Since  Vj  is  locally  consistent  with  labeling  J,, 
we  can  pick  a  junction-pair  P\  for  edge  E\  such  that 
J i  =  arcf(  P\  ,  Vj  ).  Next,  edge  E\  is  locally  consistent 
with  labeling  I\ .  so  we  can  pick  junction  J?  at  the  next 
vertex  V'2  such  that  arf  /( P\ ,  V'j)  =  J 2-  By  induction,  we 
find  a  chain  of  junctions  and  junction- pairs  that  repre¬ 
sents  a  labeling  for  the  face  boundary. 

If  the  face  boundary  is  open,  we  could  start  with  Vj  being 
the  first  vertex  of  the  boundary.  By  induction  from  one 
vertex/odge  to  the  next,  we  automatically  get  a  consistent 
labeling  of  t  he  face  boundary  if  Waltz’s  filtering  has  been 
done  If  the  face  boundary  is  closed,  we  have  to  check 
that  the  first  and  last  vortices  V j  and  \  „  can  he  labeled 
consistently  at  both  ends  of  edge  E„ .  Specifically,  we 
must  have  are  f(Pu%\\ )  =  J  j.  This  is  true  if  and  only  if 
the  face  has  junction  loops.  So  Waltz’s  filtering  captures 
labeling  constraints  over  arbitrary  open  chains,  whereas 
Jai.rt’mn  loops  rapture  constraints  over  both  open  and 
closed  chains,  going  around  the  laces  «.>f  the  image. 

•  If  a  face  and  its  neighltnnng  fans,  as  well  as  their  bound- 

on  •  vert  in  s  r:ul  all  Urn  <  <t/ustsit,,.  Ids  tings,  tin  >1 

(fit  tv  exists  at  least  out  consistent  labeling  for  all  these 
faces,  edges,  and  vertices.  Let  fare  E\  be  tin*  starting 
face  with  junction  loop  found  from  above.  Let’s  find 
a  face  labeling  for  each  neighboring  fare  of  ,  starting 
from  the  vertices  and  edges  on  the  boundary  of  /#j  (Fig 
ure  1). 


1  ir>2 


s 


Let’s  pick  junction  J ,  of  vertex  Ij  such  that 
arc  f{L\  ,\\)  =  J,.  Since  vertex  Vj  is  locally  consistent, 
we  can  pick  the  junction-loops  LtJ  for  the  adjacent  faces 
F,j  connecting  at  Ij  such  that  J,  =  arcj{  ,  V, ), 

So  the  labelings  for  the  neighboring  fares  of  Ft  exist. 
However,  for  some  fares  such  a-s  /'  i .  we  might  have  dif¬ 
ferent  fare  labelings  from  the  two  vertices  l  j  and  V2.  We 
have  to  check  that  there  exist  two  junction-loops  L\n,l,2 1 
having  the  same  labeling  at  least  local  to  the  vertices  Vj. 
V'j.  and  to  the  edge  Fa.  Equivalently,  the  two  junction- 
loops  must  have  the  same  pair  of  junctions  (J,,  J2)  for 
(Vj.V'i).  This  is  true  if  and  only  if  the  junction-pair 
I\  =  ( J, .  J2 )  of  edge  £3  is  locally  consistent  with  a 
junction-loop  La*  of  face  F4.  So  a  junction-pair  is  a  more 
complete  description  of  the  local  labeling  at  an  edge  than 
an  edge-label. 

We  have  shown  that  the  labeling  for  all  faces  joined  with 
F\  exists  and  is  locally  consistent  with  the  face  boundary 
bF\.  If  these  adjacent  faces  share  further  edges  and  ver¬ 
tices.  we  can  use  the  previous  induction,  going  from  one 
vertex/edge  to  the  next,  starting  from  the  boundary  ver¬ 
tices  and  edges  of  /if  j .  to  show  that  any  two  joining  fares 
will  have  two  fare-labelings  that  are  locally  consistent  at 
all  the  vertices  and  edges  common  to  the  two  faces. 

•  If  all  the  noeles  in  the  image  have  confident  labelings, 
then  there  exists  at  least  one  global  labeling  for  the  whole 
image.  The  segmentation  of  the  image  into  faces  guar¬ 
antees  that  labeling  all  the  faces  will  ultimately  label  the 
whole  image.  Just  like  putting  together  a  puzzle,  we  la¬ 
bel  the  whole  image  by  fitting  the  fare-labelings  with  each 
other.  We  use  the  previous  induction  to  label  consistently 
from  one  face  to  its  neighboring  fares,  covering  all  the 
fares  of  the  image.  The  induction  uses  a  common  junc¬ 
tion  or  junction-pair  if  two  boundaries  share  a  common 
vertex  or  edge.  | 

Since  C'SP  rejects  only  the  inconsistent  local  labelings  at 
a  node,  any  globally  consistent  labeling  must  be  found  from 
the  above  inductive  enumeration.  This  means  that  the  net¬ 
work  of  nodes,  with  labels  and  local  labelings  attached  to  each 
node,  can  enumerate  and  so  ‘‘contains’  all  globally  consistent 
labelings. 

Finally,  let’s  compare  the  above  findings  with  more  general 
results  from  the  consistency  o'  networks  of  constraints  (Monta- 
nari  1970.  Mackworth  1977.  Premier  1 97S ).  and  from  topology 
(Hocking  and  Young  1901.  Henle  1979).  Recall  that  image  la¬ 
beling  used  to  be  railed  line  labeling,  because  the  core  problem 
is  to  label  all  the  lines  or  edges  in  the  image,  satisfying  all 
the  constraints  at  the  vertices  and  edges.  Junctions  are  used 
to  capture  the  n-ary  constraints  among  edges  intersecting  at  a 
common  vertex,  A  vertex-node  is  needed  to  store  these  junc¬ 
tions.  T »  ,ir>  constraints  among  intersecting  edges  become 
binarv  constraints  between  these  edges  and  the  common  vertex, 
‘timilarlv.  hv  creating  high*  r  * "v l  i  k  nodes  ami  cnuimiaiing  ad 
the  locally  consistent  labelings  for  these  V  nodes,  one  ran  syn¬ 
thesize  more  extended  V-arv  constraints,  covering  finally  all 
the  edges  in  the  image  (Ere  ;der  1979).  At  the  V-th  step,  the  V- 
ary  constraints  become  binary  constraints  between  the  V-nodes 
and  the  V  —  1-nodes.  So.  arc  consistency  algorithms,  such  as 
Waltz’s  filter,  can  be  used  to  insure  ^'-satisfiability*.  1  he  auto¬ 


matic  synthesis  of  the  implicit  n-ary  constraints  is  exponential, 
and  is  equivalent,  in  the  worst  case,  to  depth-first  search  over 
the  whole  image.  Montanari  (1977)  showed  that  making  ex¬ 
plicit  all  the  implicit  n-ary  constraints  in  an  arbitrary  network 
of  unary  and  binary  constraints  is  equivalent  to  finding  all  the 
minimal  equivalent  networks,  and  is  exponential. 

Our  finding  is  not  in  conflict  with  the  above  general  results 
for  arbitrary  networks.  But  note  that  line-drawings  and  im¬ 
ages  are  not  arbitrary  networks.  From  topology  (Hocking  and 
Young  1961,  Henle  1979),  it  is  well  known  that  bv  introducing 
faces,  we  get  not  only  a  planar  graph  but  a  complete  tessela- 
tion  of  the  plane.  It  is  the  tesselation  of  the  plane  that  allows 
global  consistency  to  be  deduced  from  transitive  closure  of  local 
consistencies.  In  other  words,  the  planar  topology  makes  ex¬ 
plicit  all  the  implicit  local  n-ary  constraints,  and  global  search 
is  never  needed.  The  local  n-ary  constraints  are  constraints 
between  edges  that  intersect  at  a  vertex,  edge,  and  face.  These 
n-ary  constraints  are  raptured  in  the  enumeration  of  all  locally 
consistent  volumes  around  the  vertex,  edge,  and  face. 

The  above  parallel  labeling  algorithm  ran  be  extended  to 
find  joral/global  consistency  in  a  network  of  nodes  embedded 
in  an  arbitrary  d-dimensional  tesselation.  Exploiting  the  tes¬ 
selation  of  the  3D  volume  into  blocks,  Markowsky  and  Wesley 
(1980)  fleshed  out  wire-frames  into  polyhedra  through  the  enu¬ 
meration  of  all  topologically  consistent  vertices,  edges,  faces, 
and  blocks.  2-  and  3-dimensional  embedded  tesselations  are 
common  in  VLSI  designs,  because  of  the  physical  constraints 
inherent  in  the  layout  and  connections  between  the  VLSI  com¬ 
ponents.  Parallel  CSP  can  be  used,  for  example,  to  simulate, 
analyze,  and  debug  circuit  layouts. 


5.  Ideal  Versus  Real  Polyhedral  Images 

Figure  5  show  the  labeling  of  two  blocks,  one  on  top  of  the 
other.  The  fare  surrounding  the  two  blocks  has  16  junction- 
loops,  corresponding  to  the  blocks  floating  in  air.  or  resting 
against  some  imaginary  surface  at  some  of  its  bounding  edges. 
The  two  interpretations  correspond  to  the  face  labeled  as  image 
background  (B)  or  as  polyhedral  fare  (F).  Recognizing  from  the 
16  junction-loops  that  the  surrounding  face  is  a  touchable  back¬ 
ground  reduces  the  16  junction-loops  to  1  and  further  restricts 
the  labeling  of  the  blocks. 

The  labeling  of  the  jeep  with  122  vertices.  167  edges,  and  17 
faces,  takes  about  15  seconds  on  a  Symbolics  machine  running 
compile'!  rode  Figure  6.  Note  the  drastic  reduction  from  a 
search  spare  of  approximativelv  '1,B7  a;  10lul  possible  labelings 
of  the  whole  image  (167  edges  and  each  edge  has  -1  labels)  to 
a  solution  spare  of  96  globally  consistent  labelings,  represented 
compactly  by  one  single  network  of  nodes  with  all  locally  con¬ 
sistent  labelings. 

I’he  labeling  of  line  drawings  and  projections  of  ideal  poly- 
VcJrc.!  sr*—,-s  is  fast  and  local  It  is  Hue  that  Wail/,  s  fill  ‘"in " 
has  the  most  drastic  reduction  of  the  search  space,  from  10s0  to 
10’°  for  the  jeep  example.  However,  without  the  enumeration 
of  the  junction-pairs  and  junction-loops,  and  the  sufficiency  of 
local  consistency,  the  labeling  of  the  jeep  would  run  for  hours, 
and  output  billions  of  globally  consistent  labelings,  each  is  dif¬ 
ferent  from  the  next  at  only  one  edge-label. 


„.,s 


Figure  5:  Labeling  of  the  blocks. 


Figure  6:  Labeling  of  the  jeep. 


Labeling  of  real  polyhedral  images  proves  to  be  less  suc¬ 
cessful  because  of  unreliable  junctions  in  real  segmentations. 
Figures  7  show  a  268x200  image  of  a  V-block,  the  step  edges 
detected  from  a  Canny  edge  detector  (1986),  the  segmentation 
from  extending  or  eliminating  the  step  edges,  and  the  locally 
consistent  labels  at  all  the  edges.  Note  that  the  step  contours 
are  often  broken  because  of  thresholding  and  the  fast  change 
in  direction  of  the  gradient  near  corners.  To  get  closed  bound¬ 
aries,  we  either  extend  the  steps  looking  for  zero-crossings  of 
the  second  derivative  along  the  gradient  direction,  or  delete 
dangling  step  chains.  This  heuristic  is  consistent  with  the  facts 
that  zero-crossing  contours  are  always  closed,  and  that  there 
are  no  dangling  edges  and  vertices  in  the  projection  of  a  poly¬ 
hedral  scene.  However,  arbitrary  extensions  and  deletions  of 
the  steps  can  lead  to  tiny  loops.  By  pure  coincidence,  the  tiny 
loops  do  not  hurt  the  labeling  of  the  V-block,  because  they  ei¬ 
ther  are  disjoint  from  the  V-block,  or  touch  this  later  at  T  or 
?  junctions. 

It  is  to  no  surprise  t*>a*  such  extension  and  elimina¬ 
tion  of  the  steps  could  create  many  wrong  junctions,  miss- 
ing/extraneous  edges  and  faces,  and  completely  fool  the  la¬ 
beling  which  is  based  so  much  on  a  locally  correct  segmenta¬ 
tion.  Figures  8,  show  a  283x209  image  of  plastic  letters  (A, 
W,  E,  N),  its  segmentation  and  labeling.  To  avoid  as  much 


reflection  as  possible,  we  have  sand-blasted  the  plastic  blocks 
to  make  their  surface  matte,  used  a  diffuse  light  source,  and  re¬ 
duced  from  an  original  image  twice  as  large.  We  have  kept  the 
shadows  on  purpose,  although  they  are  not  accounted  by  the 
current  label  dictionary.  As  expected,  most  of  the  damage  to 
labeling  is  caused  not  by  shadows  but  by  wrong  local  segmen¬ 
tations  and  by  small  fluctuations  in  the  shape  of  the  junctions. 
However,  it  is  worth  to  notice  that  the  elimination  of  global 
depth-first  search  highlights  the  locality  of  labeling,  and  opens 
up  the  possibility  of  correcting  the  local  segmentation  based  on 
locally  consistent  labelings.  Any  corrections  should  be  top- 
down,  and  based  on  significant  fares,  edges,  and  vertices,  to 
avoid  small  errors  in  the  local  segmentations. 
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